614 results on '"Sun, Aixin"'
Search Results
602. Data Mining as a Key Enabler of Computational Social Science
- Author
-
Srivastava, Jaideep, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Nierstrasz, Oscar, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Sudan, Madhu, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Vardi, Moshe Y., Series editor, Weikum, Gerhard, Series editor, Datta, Anwitaman, editor, Shulman, Stuart, editor, Zheng, Baihua, editor, Lin, Shou-De, editor, Sun, Aixin, editor, and Lim, Ee-Peng, editor
- Published
- 2011
- Full Text
- View/download PDF
603. Using Web Science to Understand and Enable 21st Century Multidimensional Networks
- Author
-
Contractor, Noshir, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Nierstrasz, Oscar, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Sudan, Madhu, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Vardi, Moshe Y., Series editor, Weikum, Gerhard, Series editor, Datta, Anwitaman, editor, Shulman, Stuart, editor, Zheng, Baihua, editor, Lin, Shou-De, editor, Sun, Aixin, editor, and Lim, Ee-Peng, editor
- Published
- 2011
- Full Text
- View/download PDF
604. Harnessing online social media to deal with information overload
- Author
-
Chenliang. Li, Anwitaman Datta, Sun Aixin, School of Computer Engineering, and Centre for Advanced Information Systems
- Subjects
Computer science ,Information needs ,computer.software_genre ,Data science ,Information science ,Information overload ,World Wide Web ,Management information systems ,Named-entity recognition ,Web page ,Information system ,Relevance (information retrieval) ,Engineering::Computer science and engineering::Information systems::Information systems applications [DRNTU] ,computer - Abstract
In online social media, users become information creators and disseminators through the active interplay between information items and other users, instead of just being information consumers of a decade ago. This kind of information production and dissemination in collaborative and active manner further aggravates the problem of information overload on the World Wide Web (WWW). The existing approaches for information retrieval (IR) and natural language processing (NLP) tasks often offer an intolerable response time for Web users. Moreover, given the numerous interactions between users and information items, new kinds of information needs are emerging, such as opinion mining, event detection and summarization, etc. However, the existing IR technologies (based on bag-of-word model), and NLP technologies (based on the linguistical features), often fail to satisfy the web users in these emerging information needs. On the other hand, people participate in online social media to share stories, photos with their friends, vote and leave opinions, or tag web pages, and so on. The digital footprints of these behaviors make online social media semantic resources which we can exploit to better understand and organize the astronomical information. In this dissertation, we first analyze online social media as multi-dimensional social network by taking Wikipedia as a case study. We find that given the multiple relations exposed from different perspectives in the network, focusing on only one specific relation could lead to biased or even wrong conclusion. Traditional information retrieval approaches are mainly bag-of-word model and keyword based, which ignore the word ordering in the text and measure the relevance based on the presence of the keywords. We propose a generalized framework for word sense disambiguation based on Wikipedia. The proposed framework can enable effective and efficient disambiguation by relating keyphrases (i.e., n-grams) in the documents to their appropriate concepts in Wikipedia, where a concept is defined as a Wikipedia article. The framework is applicable to the documents of different languages with different settings. By adopting the disambiguation method, we could represent a textual document by the concepts it covers based on Wikipedia. We study the semantic tag recommendation task for web pages based on the concept model by exploring the semantic relations between tags and concepts underlying human annotation activities. Web users participate in the information generation process by commenting news articles, sharing stories and publishing opinions by posting microblogs, etc. However, the information generated by users are often short and written with free style, containing grammatical errors, informal abbreviations (e.g., comments, tweets). These adverse features deteriorate the performance of the existing algorithms for many tasks for online social media, such as named entity recognition, event detection, etc. We propose an unsupervised approach for named entity recognition in targeted Twitter stream. Within this work, we develop an algorithm of tweet segmentation which splits each tweet into non-overlapping phrases, called tweet segments. Inspired by the semantic units produced by tweet segmentation, we further propose an algorithm for event detection for tweets based on tweet segments, which is effective and scalable. DOCTOR OF PHILOSOPHY (SCE)
- Published
- 2019
605. Product name recognition and normalization in internet forums
- Author
-
Yangjie Yao, Sun Aixin, and School of Computer Engineering
- Subjects
Engineering::Computer science and engineering [DRNTU] ,Conditional random field ,Normalization (statistics) ,Multimedia ,business.industry ,Computer science ,computer.software_genre ,Named-entity recognition ,Phone ,Mobile phone ,The Internet ,Artificial intelligence ,business ,Precision and recall ,computer ,Natural language processing ,Sentence - Abstract
Collecting user feedback of products is a common practice for the product providers to better understand consumers' concerns or requirements and to further improve their products or marketing strategies. Even though dedicated review sites (e.g., Epinions, Amazon, CNET reviews) supply the relatively straightforward approach as user feedback about one specific product is usually well organized in a list, collecting user feedback from Internet forums is challenging. One reason is that user feedback about a product often spreads in different discussion threads in forums. More importantly, users often mention product names with a large number of name variations. On the other hand, Internet forums cover feedback from many more users. Thus, user feedback in more comprehensive aspects can be obtained. We propose a method named Gren to recognize and normalize mobile phone names from Internet forums. Instead of directly recognizing phone names from sentences as in most named entity recognition tasks, we propose an approach to generating candidate names as the first step. The candidate names capture short forms, spelling variations, and nicknames of products, but are not noise free. To predict whether a candidate name mention in a sentence indeed refers to a specific phone model, a CRF based name recognizer is developed. The CRF (Conditional Random Field) model is trained by using a large set of sentences obtained in a semiautomatic manner with minimal manual labeling effort. Lastly, a rule-based name normalization component maps a recognized name to its formal form. For evaluation, we randomly select 20 threads related to 20 mobile phones from an Internet forum. Each thread contains about 100 post messages. We manually labeled the mobile phone name mentions in these posts and mapped the true mentions to their formal names. In total, about 4000 sentences have been manually labeled which contain about 1000 phone name mentions. Evaluated on labeled data, Gren outperforms all baseline methods. Specifically, it achieves precision and recall of 0.918 and 0.875 respectively, with the best feature setting. Comparing to Stanford NER which is considered as a strong baseline, 134% improvement on recall is observed. We also provide detailed analysis of the intermediate results obtained by each of the three components in Gren and observe that features from Blown clustering are the most effective features. Removing them results in the largest degradation in F1 from 0.896 to 0.804. Two implications for NER tasks are further made based on our observation. First, if candidate named entities are able to be pre generated, a large number of training examples may be generated at very low cost for manual annotation. Second, if we can segment the sentences and pre-generate the text chunks, we are able to rewrite the sentences. The rewriting enables us to take surrounding words of a candidate named entity to be its context in a more natural manner. MASTER OF ENGINEERING (SCE)
- Published
- 2019
606. Algorithms for influence maximization and seed minimization
- Author
-
Yanchen Shi, Sun Aixin, and School of Computer Science and Engineering
- Subjects
Computer science and engineering::Information systems::Database management [Engineering] ,Computer science ,Minification ,Maximization ,Algorithm - Abstract
Graph is a basic mathematical tool that models information about identities as well as their complex relationships from various real world problems. It has been found important applications in analysis on social networks, route planning, telecommunication, etc. In recent years, the complexity and scale of real world graphs have increased dramatically. In particular, international social networks can comprise of hundreds of millions of users and up to billions of relationships. Thus, even algorithms with decent time or space complexities meet challenges dealing with large-scale networks for the queries like influence maximization and seed minimization on social networks. In this thesis, we investigate two problems on large scale networks for aforementioned applications, i.e., influence maximization and seed minimization. Given G a social network, M a probabilistic propagation model, and a small number k > 0, the influence maximization problem aims to find the largest expectation of the number of influenced nodes that k nodes can trigger under this pre-defined model M. This problem is derived from viral marketing, where a company gives away free samples to a small number of influential individuals in order to create a cascade of adoption via word-of-mouth effect. This study proposes a two-phase approach Influence Maximization via Martingales (IMM) that meets both practical efficiency and theoretical guarantees. In particular, IMM returns an (1 − 1/e − ε)-approximate solution with at least 1 − n ^(−ℓ) probability in an O((k+ℓ)(n+m)logn/ε^2 ) running time. IMM is further extended to fit in triggering model and time-continuous model. We experimentally evaluate IMM with the state-of-the-art benchmarks under several diffusion models and parameter settings, using large networks with up to 1.4 billion edges. The experimental results show that our approach consistently outperforms the states of the art in terms of efficiency. The seed minimization problem is a variant problem of the influence maximization with the same origin from advertising. Given a social network G and a covering threshold t, the seed minimization problem is aimed to find a seed set S that has an expected influence nodes not less than t·n and minimizes the size of S. Compared to the influence maximization that maximizes the influence given a certain budget, the seed minimization problem hopes to retrench the expense to the minimum number while keeping the influence above a predefined threshold. To solve the problem, we propose GSM, a greedy algorithm with tight approximations, high generalization and easy implementations. In particular, it yields a ⌈(1 + ϕ)log(tn)⌉-approximate solution with at least 1 − n ^(−ℓ) probability, where ℓ and ϕ are both tunable. We experimentally evaluate GSM in several settings of both t and β, and it is often orders of magnitude faster compared to the traditional greedy benchmark MINTSS. GSM also gives an impressive performance on a large graph Twitter with more than a billion edges. Master of Engineering
- Published
- 2019
607. Event-driven data collection and summarization
- Author
-
Xin Zheng, Sun Aixin, and School of Computer Science and Engineering
- Subjects
Data collection ,business.industry ,Computer science ,Event (relativity) ,Artificial intelligence ,computer.software_genre ,business ,Automatic summarization ,computer ,Natural language processing ,Engineering::Computer science and engineering::Computing methodologies::Document and text processing [DRNTU] - Abstract
The online social networks have experienced an unprecedented proliferation. Various platforms change the way people learning about information, particularly on ongoing events, which can only be known from mainstream media in the past. The social media platforms have many pleasing properties compared with traditional media: convenient, detailed, fast and interactive. This provides us an opportunity to learn immediate and detailed information about event of interest, which is highly preferred by decision makers and the public especially when emergencies and significant events happen. Keyword search function is provided by social media platforms like Twitter for users to search messages containing the query keyword(s). However, the returned results are piecemeal due to length limitation, could be mixed with irrelevant tweets or are incomplete due to inappropriate query. This calls for research on collecting clean and complete event-related messages from social media platform. In this dissertation, the researches are conducted on Twitter platform. The collected event relevant tweets could be a large set. Presenting the data in a concise and representative form could help end users to have a general idea of the collected information. Therefore, after collecting event-related tweets, we aim to construct a summary for the large set of data. Collecting clean and complete event-related tweets from Twitter Stream is not a trivial problem. The challenges are as follows: (i) the great volume of tweets make the filtering process a heavy workload; (ii) tweets are short, noisy and with many abbreviations, dialects and misspellings which make it difficult to identify event-related messages, especially when we do not have enough training data for distinction; (iii) events are evolving and the collection should be adaptive to the development of events. The proposed methods in this dissertation deal with these challenges. As stated before, tweets are noisy and casually written. It is not suitable to extract tweets as summary directly. Therefore, we turn to well-written news articles linked by URLs in tweets, which should be of the same topic with event-related tweets. We call tweets with URLs linking to news as linking tweets. The news report the main information about the event of interest, while tweets could be diverse, including people’s focuses, comments, and other complementary information to news. We aim to construct a summary based on both the linked news articles and event-related tweets. The summary should not only highlight the key points in news articles but also address people’s focuses about the event. Peoples’ focuses are usually ignored when summarizing only news articles, but they are important when presenting the core information to people who care about the event. To sum up, in this dissertation, we propose approaches of collecting clean and complete event-related tweets from Twitter Stream, and unsupervised models on summarizing single and multiple news documents with linking tweets. Doctor of Philosophy
- Published
- 2018
608. Supporting information needs of developers through web Q&A discussions
- Author
-
Li, Jing, Sun Aixin, and School of Computer Science and Engineering
- Subjects
Engineering::Computer science and engineering [DRNTU] - Abstract
Programming is evolving because of the prevalence of the Web. Nowadays, it is a common activity that developers search the Web to find information in order to solve the problems they encounter while working on software development tasks. However, existing studies investigated the information needs of developers on the Web via qualitative analysis and questionnaire survey. Unfortunately, little is known about the developers' micro-level information behaviors and needs on the Web during software development. For example, how often did the developers refine existing queries and/or create new queries? and how many web pages were opened after a search? To fill this gap, we conducted an empirical study to investigate the strategies that how developers seek and use web resources at the micro-level. The empirical study revealed three key insights: First, developers might have an incomplete or even incorrect understanding of their needs; Second, there is a gap between the producers and consumers of software documentation; Third, many important pieces of information that developers need are explicitly undocumented in software documentation. There insights motivated further studies of supporting developers' information needs. More specifically, the contributions of this thesis are: (1) Understanding information needs of developers: We developed a video scraping tool to automatically extract developers' behavioral data from the task videos. We conducted a micro-level quantitative analysis of the developers' information, including patterns of keyword sources, keyword refinement, web pages visited, context switching, and information flow. The outcomes of this micro-level quantitative analysis provided three important insights for supporting developers' information needs. (2) Discovering learning resources: To bridge the information gap in the first insight, we developed our LinkLive technique to recommend more correlated learning resources when developers know less. LinkLive uses multiple features, including hyperlink co-occurrences in web Q&A discussions, locations (e.g., question, answer, or comment) in which hyperlinks are referenced, and votes for posts/comments in which hyperlinks are referenced. A large-scale evaluation shows that our technique recommends correlated web resources with satisfactory precision and recall in an open setting. (3) Answering programming questions: To bridge the information gap in the second insight, we proposed a novel deep-learning-to-answer framework, named QDLinker, for answering programming questions with software documentation. QDLinker leverages the large volume of discussions in Community-based Question Answering (CQA) to bridge the semantic gap between programmers' questions and software documentation. Through extensive experiments, we show that QDLinker significantly outperforms the baselines based on traditional retrieval models and Web search services dedicated for software documentation. (4) Distilling crowdsourced negative caveats: To bridge the information gap in the third insight, we proposed DISCA, a novel approach to automatically distilling desirable Application Program Interface (API) negative caveats from unstructured web Q&A discussions. The quantitative and qualitative evaluations show that DISCA can greatly augment the official API documentation. Doctor of Philosophy (SCE)
- Published
- 2018
609. TaCLe - Learning Constraints in Tabular Data
- Author
-
Luc De Raedt, Samuel Kolb, Sergey Paramonov, Tias Guns, Vrije Universiteit Brussel, Business technology and Operations, Electromobility research centre, Lim, Ee-Peng, Winslett, Marianne, Sanderson, Mark, Fu, Ada Wai-Chee, Sun, Jimeng, Culpepper, J Shane, Lo, Eric, Ho, Joyce C, Donato, Debora, Agrawal, Rakesh, Zheng, Yu, Castillo, Carlos, Sun, Aixin, Tseng, Vincent S, and Li, Chenliang
- Subjects
Decision Sciences(all) ,Constraint learning ,Theoretical computer science ,Computer science ,Spreadsheets ,Business, Management and Accounting(all) ,Statistical relational learning ,020207 software engineering ,02 engineering and technology ,Table (information) ,Row and column spaces ,Constraint Learning ,Constraint (information theory) ,0202 electrical engineering, electronic engineering, information engineering ,Relational Learning ,020201 artificial intelligence & image processing ,User interface - Abstract
Spreadsheet data is widely used today by many different people and across industries. However, writing, maintaining and identifying good formulae for spreadsheets can be time consuming and error-prone. To address this issue we have introduced the TaCLe system (Tabular Constraint Learner). The system tackles an inverse learning problem: given a plain comma separated file, it reconstructs the spreadsheet formulae that hold in the tables. Two important considerations are the number of cells and constraints to check, and how to deal with multiple formulae for the same cell. Our system reasons over entire rows and columns and has an intuitive user interface for interacting with the learned constraints and data. It can be seen as an intelligent assistance tool for discovering formulae from data. FWO ERC-ADG-201 project 694980 SYNTH funded by the European Research Council ispartof: pages:2511-2514 ispartof: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM 2017 vol:Part F131841 pages:2511-2514 ispartof: Conference on Information and Knowledge Management (CIKM) location:Singapore date:6 Nov - 10 Nov 2017 status: published
- Published
- 2017
- Full Text
- View/download PDF
610. Spam analysis and detection on microblog
- Author
-
Sedhai Surendra, Sun Aixin, and School of Computer Science and Engineering
- Subjects
Engineering::Computer science and engineering::Computer applications::Social and behavioral sciences [DRNTU] - Abstract
Micro-blogging platforms such as Twitter and Weibo are popular platforms for information collection and dissemination. Due to ease of posting content and the huge social graph, micro-blogging platforms have attracted the attention of not only legitimate users but also spammers. Increasing activities of spammers on micro-blogging platforms make spam a serious problem which also affect user experience. Due to quick propagation of content in microblogging services, it is highly desirable to detect spam in real-time to minimize its impact. Hence, real-time spam detection technique that leverage fine-grained information is essential. As Twitter is one of the most popular micro-blogging platforms, in this dissertation we used Twitter dataset to study spam on microblog. Spam detection is an active area of research on Twitter. Most of the studies are focused on identifying spammers. Spam issues on Twitter are tackled by blocking spammers detected by the system. A user account that mistakenly grant permission to a malicious third party application may get blocked due to the posts by the application on behalf of the user. Similarly, compromised microblog accounts may get blocked because of the tweets posted by hijackers. Further, spammers are generally detected only after posting many spam tweets. To address the issue of spam on Twitter, we have used tweet-level spam detection is necessary. Tweet-level spam detection is inherently a challenging task as tweet is short and noisy text. Spam tweets heavily exploit hashtags to promote the tweets to the wider audience. Hence, we focused on hashtag oriented spam detection by collecting tweets using trending hashtags as a query. Further, we propose an effective way of labeling tweets for generating a dataset for such task. To the best our knowledge, there was no any benchmark dataset hence we present HSpam14, a public dataset consisting of 14 million tweets labeled with the proposed labeling technique. We made a detailed tweet-level analysis based on hashtags and tweet content, and user-level analysis based on user profiles. Detailed understanding about spam tweets and legitimate tweets, which are also know as ham tweets, are utilized to design spam tweet detection system. Unlabeled tweets are easy to obtain and they can be utilized to improve system performance and to deal with evolving spamming activities. Hence, we proposed semi-supervised real-time spam detection system to effectively identify spam tweets. Spams in a microblog introduce problems in several functionalities such as search, recommendation, and text analysis. As a case study, we analyze the effect of spam tweets on hashtag recommendation using the HSpam14 dataset. We observed that features and methods that are effective for spam tweet collection may not be effective for legitimate tweets. Our study shows that experiment conducted on spammy dataset gives misleading results. Hence, it is crucial to perform spam filtering before conducting any analysis on Twitter. In a nutshell, this dissertation elucidates the effectiveness of tweet-level spam detection and different aspects of spam and ham tweets and also paves the way for further research on fine-grained spam detection on microblog. Doctor of Philosophy (SCE)
- Published
- 2017
611. Knowledge discovery from forum data
- Author
-
Li, Jun, Sun Aixin, and Wee Kim Wee School of Communication and Information
- Subjects
Engineering::Computer science and engineering::Information systems::Information storage and retrieval [DRNTU] - Abstract
Advancement in information retrieval and data mining techniques has provided more and more useful mechanisms for the retrieval of most relevant information from documents, as well as for knowledge discovery from the same. The knowledge embedded in online forums, a kind of knowledge-rich data source, has yet be fully utilized because of the limited search functionalities provided by most existing forum platforms. This project provides a prototype solution to improve search functions of online forums. More specifically, a multithreaded Crawler and a Parser have been implemented to download and parse the posts published in a local forum in HTML format. A Topic Modeler which is built based on the MALLET package is used to generate the high-level topics of the forum data. An Indexer and a Searcher are then developed based on Lucene, to support searching over the forum data. A web search interface which supports sophisticated search requests and search result facet visualization is developed for users to discover knowledge in online forums. As the result, the solution provided by this project allows users to search relevant information by simple (e.g. single-keyword) as well as sophisticated queries. It also shows users a high-level view of the search results in aggregative and multi-facet visualized form. Furthermore, it enables users to understand the high-level topics of the search results by topic modeling. This search interface helps users to find the relevant information more effectively and efficiently. This study ends with a few limitations identified but not tackled due to the project scope and time constraint. Nevertheless, recommendations on addressing these limitations are made as future work. Master of Science (Information Studies)
- Published
- 2015
612. Ranking user generated content using topic models
- Author
-
Zongyang Ma, Sun Aixin, School of Computer Engineering, and Centre for Advanced Information Systems
- Subjects
Topic model ,Information retrieval ,Probabilistic latent semantic analysis ,Computer science ,Question answering ,User-generated content ,Relevance (information retrieval) ,Information overload ,Ranking (information retrieval) ,Adversarial information retrieval ,Engineering::Computer science and engineering::Information systems::Information storage and retrieval [DRNTU] - Abstract
With the popularity of Web 2.0, more and more users express and share opinions through various online platforms. Example platforms include news websites that support user commenting like Yahoo! News, social network sites that allow users to post messages like Facebook and Twitter, and community-based question answering sites which let users ask and answer questions. As the result, a huge amount of User Generated Content (UGC) is accumulated online in the forms of comments, tweets, question and answer posts, and others. Depending on the platform within which UGC is created, UGC may be associated with different types of attributes such as creator, time, location, text and social connections of its creator. On the other hand, UGC data from different platforms shares similar characteristics: huge amount, free writing style, and heterogeneous nature. More importantly, UGC data often demonstrates master-slave relationship. A comment is associated with a news article; a hashtag is an annotation of its embedded tweet; an answer does not exist without a question. Here, news articles, tweets, and questions are master documents while comments, hashtags, and answers are slave documents. Although topic modeling (e.g., LDA and PLSA) has been widely used to model text collections, discovering fine-grained topics from UGC with the consideration of master-slave relationship remains an open and challenging problem. In this research, the generative process of UGC data is simulated using topic models for the ranking of slave documents of given master documents with the aim of reducing information overload. Depending on the platform that UGC data is created in, three sub-problems are defined and addressed: (i) comment ranking for news articles, (ii) hashtag ranking for tweets, and (iii) answer ranking for questions. Comment ranking is essential for identifying the important comments as a summary of user discussion for a news article. In this task, we assume that topics of slave documents cover the topics of their corresponding master document, and also the topics discussed solely in comments. For this problem, we propose two LDA-style topic models, namely, Master-Slave Topic Model (MSTM) and Extended Master-Slave Topic Model (EXTM). MSTM model constrains that the topics discussed in comments have to be derived from the commenting news article. EXTM model allows generating words of comments using both the topics derived from the commenting news article, and the topics derived from all comments themselves. Evaluated on Yahoo! News, the proposed models outperform baseline methods. Hashtag ranking is important for tweet annotation and retrieval. Here, we assume that the topics of slave documents are the topical summary of their corresponding master documents. For this problem, we propose two PLSA-style topic models to model the hashtag annotation behavior. Content-Pivoted Model (CPM) assumes that tweet content guides the generation of hashtags, while Hashtag-Pivoted Model (HPM) assumes that hashtags guide the generation of tweet content. The experimental results demonstrate that CPM is most effective for ranking the most relevant hashtags of tweets. Answer ranking enables users to easily pick up the best answers for questions. In this task, we assume that topics of slave documents and topics of their corresponding master documents are similar but words of slave topics and master topics are drawn from different vocabularies. For this problem, we propose a PLSA-style topic model, namely, Tri-Role Topic Model (TRTM), to model the tri-roles of users (i.e., as askers, answerers, and voters, respectively) and the activities of each role including composing question, selecting question to answer, contributing and voting answers. Evaluated on Stack Overflow data, TRTM outperforms state-of-the-art methods for ranking high-quality answers for given questions. These three problems are all on ranking UGC data from different platforms using topic models and the proposed topic models are extended depending on the master-slave structure of UGC data. For the problem of comment ranking, the slave documents (comments) are much shorter than their corresponding master document (news article). Our main concern is discovering topics from comments which reflect the topics of their news article as well as keeping topics merely discussed among comments. For the problem of hashtag ranking, the slave documents (hashtags) are extremely short, and sometimes the hashtag is just the abbreviation of one or a few words. Compared with comment ranking, hashtag ranking is more difficult and we thus introduce more factors (e.g., user and time) to enrich the hashtag representation. Lastly, for the problem of answer ranking, the answer has an important feature of vote. It is challenging for us to model the voting behavior of users in a generative model. To address this task, we focus more on modeling the relationships between questions, answers, askers and answerers using the exponential KL-divergence function. In this research, we define three ranking problems of User Generated Content. To address these problems, we propose several extended topic models to fit the characteristics and the structure of UGC data from different platforms. From Yahoo! News to Twitter, then to Stack Overflow, the features of the adopted data in our research are more and more complicated. The designed topic models include more features and relationships to more accurately simulate the generation process of UGC data. Experimental results show that our methods outperform baseline methods for all three problems. DOCTOR OF PHILOSOPHY (SCE)
- Published
- 2015
613. A study of geographical neighborhood influence to business rating prediction
- Author
-
Longke Hu, Sun Aixin, and School of Computer Engineering
- Subjects
Information retrieval ,Information engineering ,Computer science ,Information system ,Data science ,Engineering::Computer science and engineering::Information systems::Information storage and retrieval [DRNTU] - Abstract
Rating prediction is to predict the preference rating of a user to an item that she has not rated before, and it is one of the most popular and fundamental problems in recommendation systems. Using the business review data from Yelp, we study the problem of business rating prediction in this thesis. A business here can be a restaurant, a shopping mall, a nightlife club or other kind of businesses. Different from most other types of items that have been studied in various recommender sys- tems (e.g., movie, song, book), a business in Yelp physically exists at a geographical location, and most businesses have geographical neighbors within walking distance. When a user visits a business, there is a good chance that she walks by its neighbors. Through data analysis on Yelp, we find that there exists weak positive correlation between a business’s ratings and its neighbors’ ratings, and the positive correlation in ratings is independent of the categories of the businesses and/or their neighbors. Based on this observation, we assume that a user’s rating to a given business is determined by both the intrinsic characteristics of the business and the extrinsic characteristics of its geographical neighbors. Using the widely adopted latent factor model for rating prediction, in our proposed solution, we use two kinds of latent factors to model a business: one for its intrinsic characteristics and the other for its extrinsic characteristics. More specifically, the former encodes the intrinsic charac- teristics of a business (e.g., taste of food and quality of service) observable by users who have interacted with the business. The latter encodes the extrinsic characteris- tics of a business (e.g., hygiene standard) in influencing its geographical neighbors observable by the “pass-by” visitors. We conduct extensive experiments on the Yelp dataset to evaluate the proposed models, and compare the models with state-of-the-art baseline methods. We show that by incorporating geographical neighborhood influences, much lower prediction error is achieved than the baseline models including Biased MF, SVD++, and Social MF. The prediction error is further reduced by incorporating influences from business category and review content. MASTER OF ENGINEERING (SCE)
- Published
- 2014
614. Mining user-created content for document summarization and event detection
- Author
-
Hu, Meishan., Sun Aixin, School of Computer Engineering, and Centre for Advanced Information Systems
- Subjects
Engineering::Computer science and engineering::Information systems [DRNTU] - Abstract
Empowered with the ability of creating content using advanced Web services and ease-to-publish tools, today’s Web users are creating content and contributing knowledge through various Web activities. As a result, the Web is abundant with user-created content. With the aim to derive collective intelligence and wisdom-of-the-crowd, we conducted research in knowledge mining from user-created content. Our research focused on three forms of user-created content, including comments, blogs, and search queries. Being one of the important features in blogs, comments written by readers are believed to represent readers’ feedback about documents. From our user study conducted on blog reading, we found that human summarizers selected significantly different sets of sentences from the blog posts before and after reading comments. Hence, we proposed and studied the problem of comments-oriented document summarization, whose goal is to extract a subset of sentences from a given document that best reflects the topics not only presented in the document but also discussed among the associated comments. To generate comments-oriented summary, we proposed and evaluated a number of methods under two separate approaches. In feature-scoring approach, we view words as the features that bridge the semantics in document and the associated comments and scored sentences according to their contained words. As the important containers of words, the set of comments was scored through either graph-based or tensor-based scoring method based on three relations (i.e., topic, quotation, and mention) identified among comments. In language-modeling approach, we view the desire of a summary as an information need, and estimate a language model of comments-oriented summary from the document language model and comments language model. Sentences are then ranked through either Odds Ratio selection or Negative Kullback-Leibler Divergence selection. Doctor of Philosophy
- Published
- 2011
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.