102 results on '"Sun, Aixin"'
Search Results
2. Natural language processing in urology: Automated extraction of clinical information from histopathology reports of uro-oncology procedures
- Author
-
Huang, Honghong, Lim, Fiona Xin Yi, Gu, Gary Tianyu, Han, Matthew Jiangchou, Fang, Andrew Hao Sen, Chia, Elian Hui San, Bei, Eileen Yen Tze, Tham, Sarah Zhuling, Ho, Henry Sun Sien, Yuen, John Shyi Peng, Sun, Aixin, and Lim, Jay Kheng Sit
- Published
- 2023
- Full Text
- View/download PDF
3. Asking Clarifying Questions: To benefit or to disturb users in Web search?
- Author
-
Zou, Jie, Sun, Aixin, Long, Cheng, Aliannejadi, Mohammad, and Kanoulas, Evangelos
- Published
- 2023
- Full Text
- View/download PDF
4. The crowd in MOOCs: a study of learning patterns at scale.
- Author
-
Zhou, Xin, Sun, Aixin, Zhang, Jie, and Lin, Donghui
- Subjects
- *
MASSIVE open online courses , *SEQUENTIAL pattern mining , *RECOMMENDER systems , *COSINE function , *PERIODIC functions - Abstract
The increasing availability of learning activity data in Massive Open Online Courses (MOOCs) enables us to conduct a large-scale analysis of learners' learning behavior. In this paper, we analyze a dataset of 351 million learning activities from 0.8 million unique learners enrolled in over 1.6 thousand courses within two years. Specifically, we mine and identify the learning patterns of the crowd from both temporal and course enrollment perspectives leveraging mutual information theory and sequential pattern mining methods. From the temporal perspective, we find that the time intervals between consecutive learning activities of learners exhibit a mix of power-law and periodic cosine function distribution. By qualifying the relationship between course pairs, we observe that the most frequently co-enrolled courses usually fall in the same category or the same university. We demonstrate these findings can facilitate manifold applications including recommendation tasks on courses. A simple recommendation model utilizing the course enrollment patterns is competitive to the baselines with 200× faster training time. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
5. DP-GCN: Node Classification by Connectivity and Local Topology Structure on Real-World Network.
- Author
-
Chen, Zhe and Sun, Aixin
- Subjects
TOPOLOGY ,CLASSIFICATION ,PAYMENT ,ROUTING algorithms ,SUPPLIERS - Abstract
Node classification is to predict the class label of a node by analyzing its properties and interactions in a network. We note that many existing solutions for graph-based node classification only consider node connectivity but not the node's local topology structure. However, nodes residing in different parts of a real-world network may share similar local topology structures. For example, local topology structures in a payment network may reveal sellers' business roles (e.g., supplier or retailer). To model both connectivity and local topology structure for better node classification performance, we present DP-GCN, a dual-path graph convolution network. DP-GCN consists of three main modules: (i) a C-GCN module to capture the connectivity relationships between nodes, (ii) a T-GCN module to capture the topology structure similarity among nodes, and (iii) a multi-head self-attention module to align both properties. We evaluate DP-GCN on seven benchmark datasets against diverse baselines to demonstrate its effectiveness. We also provide a case study of running DP-GCN on three large-scale payment networks from PayPal, a leading payment service provider, for risky seller detection. Experimental results show DP-GCN's effectiveness and practicability in large-scale settings. PayPal's internal testing also shows DP-GCN's effectiveness in defending against real risks from transaction networks. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
6. DiffuRec: A Diffusion Model for Sequential Recommendation.
- Author
-
Li, Zihao, Sun, Aixin, and Li, Chenliang
- Abstract
The article focuses on DiffuRec, a diffusion model for sequential recommendation, proposing a new generative paradigm to represent items' latent aspects and users' diverse preferences. Topics include the limitations of fixed vectors in capturing latent aspects, the adaptability of diffusion models in representing distributions for item representations, and the injection of uncertainty for more robust item embedding learning.
- Published
- 2024
- Full Text
- View/download PDF
7. The geography of corporate fake news.
- Author
-
Darendeli, Alper, Sun, Aixin, and Tay, Wee Peng
- Subjects
- *
FAKE news , *FOREIGN news , *INTERVENTION (International law) , *FOREIGN investments , *CAPITAL market , *NEWS websites , *GEOGRAPHY , *COUNTRIES - Abstract
Although a rich academic literature examines the use of fake news by foreign actors for political manipulation, there is limited research on potential foreign intervention in capital markets. To address this gap, we construct a comprehensive database of (negative) fake news regarding U.S. firms by scraping prominent fact-checking sites. We identify the accounts that spread the news on Twitter (now X) and use machine-learning techniques to infer the geographic locations of these fake news spreaders. Our analysis reveals that corporate fake news is more likely than corporate non-fake news to be spread by foreign accounts. At the country level, corporate fake news is more likely to originate from African and Middle Eastern countries and tends to increase during periods of high geopolitical tension. At the firm level, firms operating in uncertain information environments and strategic industries are more likely to be targeted by foreign accounts. Overall, our findings provide initial evidence of foreign-originating misinformation in capital markets and thus have important policy implications. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
8. Efficient approximation algorithms for adaptive influence maximization
- Author
-
Huang, Keke, Tang, Jing, Han, Kai, Xiao, Xiaokui, Chen, Wei, Sun, Aixin, Tang, Xueyan, and Lim, Andrew
- Published
- 2020
- Full Text
- View/download PDF
9. NEXT: a neural network framework for next POI recommendation
- Author
-
Zhang, Zhiqian, Li, Chenliang, Wu, Zhiyong, Sun, Aixin, Ye, Dengpan, and Luo, Xiangyang
- Published
- 2020
- Full Text
- View/download PDF
10. Learning to answer programming questions with software documentation through social context embedding
- Author
-
Li, Jing, Sun, Aixin, and Xing, Zhenchang
- Published
- 2018
- Full Text
- View/download PDF
11. LinkLive: discovering Web learning resources for developers from Q&A discussions
- Author
-
Li, Jing, Xing, Zhenchang, and Sun, Aixin
- Published
- 2019
- Full Text
- View/download PDF
12. Dataset versus reality: Understanding model performance from the perspective of information need.
- Author
-
Yu, Mengying and Sun, Aixin
- Subjects
- *
STATISTICAL models , *TASK performance , *DATA curation , *INFORMATION needs , *INFORMATION retrieval , *DEEP learning - Abstract
Deep learning technologies have brought us many models that outperform human beings on a few benchmarks. An interesting question is: can these models well solve real‐world problems with similar settings (e.g., identical input/output) to the benchmark datasets? We argue that a model is trained to answer the same information need in a similar context (e.g., the information available), for which the training dataset is created. The trained model may be used to solve real‐world problems for a similar information need in a similar context. However, information need is independent of the format of dataset input/output. Although some datasets may share high structural similarities, they may represent different research tasks aiming for answering different information needs. Examples are question–answer pairs for the question answering (QA) task, and image‐caption pairs for the image captioning (IC) task. In this paper, we use the QA task and IC task as two case studies and compare their widely used benchmark datasets. From the perspective of information need in the context of information retrieval, we show the differences in the dataset creation processes and the differences in morphosyntactic properties between datasets. The differences in these datasets can be attributed to the different information needs and contexts of the specific research tasks. We encourage all researchers to consider the information need perspective of a research task when selecting the appropriate datasets to train a model. Likewise, while creating a dataset, researchers may also incorporate the information need perspective as a factor to determine the degree to which the dataset accurately reflects the real‐world problem or the research task they intend to tackle. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
13. A time-aware trajectory embedding model for next-location recommendation
- Author
-
Zhao, Wayne Xin, Zhou, Ningnan, Sun, Aixin, Wen, Ji-Rong, Han, Jialong, and Chang, Edward Y.
- Published
- 2018
- Full Text
- View/download PDF
14. Exploring prestigious citations sourced from top universities in bibliometrics and altmetrics: a case study in the computer science discipline
- Author
-
Luo, Feiheng, Sun, Aixin, Erdt, Mojisola, Sesagiri Raamkumar, Aravind, and Theng, Yin-Leng
- Published
- 2017
- Full Text
- View/download PDF
15. Searching for the internet of things: where it is and what it looks like
- Author
-
Shemshadi, Ali, Sheng, Quan Z., Qin, Yongrui, Sun, Aixin, Zhang, Wei Emma, and Yao, Lina
- Published
- 2017
- Full Text
- View/download PDF
16. Mobile phone name extraction from internet forums: a semi-supervised approach
- Author
-
Yao, Yangjie and Sun, Aixin
- Published
- 2016
- Full Text
- View/download PDF
17. Conformity-aware influence maximization in online social networks
- Author
-
Li, Hui, Bhowmick, Sourav S., Sun, Aixin, and Cui, Jiangtao
- Published
- 2015
- Full Text
- View/download PDF
18. Performance Measurement Framework for Hierarchical Text Classification.
- Author
-
Sun, Aixin, Lim, Ee-Peng, and Ng, Wee-Keong
- Abstract
Discusses hierarchical text classification for electronic information retrieval and the measures used to evaluate performance. Proposes new performance measures that consist of category similarity measures and distance-based measures that consider the contributions of misclassified documents, and explains a blocking measure that identifies non-performing classifiers. (Author/LRW)
- Published
- 2003
19. Web classification of conceptual entities using co-training
- Author
-
Sun, Aixin, Liu, Ying, and Lim, Ee-Peng
- Published
- 2011
- Full Text
- View/download PDF
20. Detecting spam blogs from blog search results
- Author
-
Zhu, Linhong, Sun, Aixin, and Choi, Byron
- Published
- 2011
- Full Text
- View/download PDF
21. Discovery of concept entities from web sites using web unit mining
- Author
-
Yin Ming, Ming, Hoe‐lian Goh, Dion, Lim, Ee‐Peng, and Sun, Aixin
- Published
- 2005
- Full Text
- View/download PDF
22. Affinity-driven blog cascade analysis and prediction
- Author
-
Li, Hui, Bhowmick, Sourav S, Sun, Aixin, and Cui, Jiangtao
- Published
- 2014
- Full Text
- View/download PDF
23. Point-of-Interest Recommendation With Global and Local Context.
- Author
-
Han, Peng, Shang, Shuo, Sun, Aixin, Zhao, Peilin, Zheng, Kai, and Zhang, Xiangliang
- Subjects
MATRIX decomposition ,DATA distribution ,LINEAR programming - Abstract
The task of point of interest (POI) recommendation aims to recommend unvisited places to users based on their check-in history. A major challenge in POI recommendation is data sparsity, because a user typically visits only a very small number of POIs among all available POIs. In this paper, we propose AUC-MF to address the POI recommendation problem by maximizing Area Under the ROC curve (AUC). AUC has been widely used for measuring classification performance with imbalanced data distributions. To optimize AUC, we transform the recommendation task to a classification problem, where the visited locations are positive examples and the unvisited are negative ones. We define a new lambda for AUC to utilize the LambdaMF model, which combines the lambda-based method and matrix factorization model in collaborative filtering. Many studies have shown that geographic information plays an important role in POI recommendation. In this study, we focus on two levels geographic information: local similarity and global similarity. We further show that AUC-MF can be easily extended to incorporate geographical contextual information for POI recommendation. Specifically, we propose two novel methods to incorporate geographical information in AUC-MF. Different from most existing models where the contextual information are incorporated into the objective function, the incorporation of contextual information in AUC-MF is a refinement of the model and a sampling strategy. The sampling strategy could speedup convergence and the refining of recommendations is independent of training of the model. This mechanism also enables AUC-MF to be able produce recommendations refined towards different contextual information, with minimum computational cost. Experiments on two datasets show that the proposed AUC-MF outperforms state-of-the-art methods significantly in terms of recommendation accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
24. Imbalanced text classification: A term weighting approach
- Author
-
Liu, Ying, Loh, Han Tong, and Sun, Aixin
- Published
- 2009
- Full Text
- View/download PDF
25. On strategies for imbalanced text classification using SVM: A comparative study
- Author
-
Sun, Aixin, Lim, Ee-Peng, and Liu, Ying
- Published
- 2009
- Full Text
- View/download PDF
26. Natural Language Video Localization: A Revisit in Span-Based Question Answering Framework.
- Author
-
Zhang, Hao, Sun, Aixin, Jing, Wei, Zhen, Liangli, Zhou, Joey Tianyi, and Goh, Rick Siow Mong
- Subjects
- *
NATURAL languages , *QUESTION answering systems , *VIDEO excerpts , *COMPUTER vision , *PROBLEM solving , *VIDEOS - Abstract
Natural Language Video Localization (NLVL) aims to locate a target moment from an untrimmed video that semantically corresponds to a text query. Existing approaches mainly solve the NLVL problem from the perspective of computer vision by formulating it as ranking, anchor, or regression tasks. These methods suffer from large performance degradation when localizing on long videos. In this work, we address the NLVL from a new perspective, i.e., span-based question answering (QA), by treating the input video as a text passage. We propose a video span localizing network (VSLNet), on top of the standard span-based QA framework (named VSLBase), to address NLVL. VSLNet tackles the differences between NLVL and span-based QA through a simple yet effective query-guided highlighting (QGH) strategy. QGH guides VSLNet to search for the matching video span within a highlighted region. To address the performance degradation on long videos, we further extend VSLNet to VSLNet-L by applying a multi-scale split-and-concatenation strategy. VSLNet-L first splits the untrimmed video into short clip segments; then, it predicts which clip segment contains the target moment and suppresses the importance of other segments. Finally, the clip segments are concatenated, with different confidences, to locate the target moment accurately. Extensive experiments on three benchmark datasets show that the proposed VSLNet and VSLNet-L outperform the state-of-the-art methods; VSLNet-L addresses the issue of performance degradation on long videos. Our study suggests that the span-based QA framework is an effective strategy to solve the NLVL problem. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
27. Mining latent relations in peer-production environments: a case study with Wikipedia article similarity and controversy
- Author
-
Li, Chenliang, Datta, Anwitaman, and Sun, Aixin
- Published
- 2012
- Full Text
- View/download PDF
28. Web data extraction based on structural similarity
- Author
-
Li, Zhao, Ng, Wee Keong, and Sun, Aixin
- Published
- 2005
- Full Text
- View/download PDF
29. An energy-efficient and access latency optimized indexing scheme for wireless data broadcast
- Author
-
Yao, Yuxia, Tang, Xeuyan, Lim, Ee-Peng, and Sun, Aixin
- Subjects
Mobile communication systems -- Methods ,Wireless communication systems -- Methods ,Scheduling (Management) -- Methods ,Energy conservation -- Methods ,Wireless technology ,Business ,Computers ,Electronics ,Electronics and electrical industries - Abstract
Data broadcast is an attractive data dissemination method in mobile environments. To improve energy efficiency, existing air indexing schemes for data broadcast have focused on reducing tuning time only, i.e., the duration that a mobile client stays active in data accesses. On the other hand, existing broadcast scheduling schemes have aimed at reducing access latency through nonflat data broadcast to improve responsiveness only. Not much work has addressed the energy efficiency and responsiveness issues concurrently. This paper proposes an energy-efficient indexing scheme called MHash that optimizes tuning time and access latency in an integrated fashion. MHash reduces tuning time by means of hash-based indexing and enables nonflat data broadcast to reduce access latency. The design of hash function and the optimization of bandwidth allocation are investigated in depth to refine MHash. Experimental results show that, under skewed access distribution, MHash outperforms state-of-the-art air indexing schemes and achieves access latency close to optimal broadcast scheduling. Index Terms--Wireless data broadcast, energy conservation, latency, indexing, scheduling, mobile computing.
- Published
- 2006
30. Blocking reduction strategies in hierarchical text classification
- Author
-
Sun, Aixin, Lim, Ee-Peng, Ng, Wee-Keong, and Srivastava, Jaideep
- Subjects
Data mining ,Electronic data processing ,Data warehousing/data mining ,Business ,Computers ,Electronics ,Electronics and electrical industries - Abstract
One common approach in hierarchical text classification involves associating classifiers with nodes in the category tree and classifying text documents in a top-down manner. Classification methods using this top-down approach can scale well and cope with changes to the category trees. However, all these methods suffer from blocking which refers to documents wrongly rejected by the classifiers at higher-levels and cannot be passed to the classifiers at lower-levels. In this paper, we propose a classifier-centric performance measure known as blocking factor to determine the extent of the blocking. Three methods are proposed to address the blocking problem, namely, Threshold Reduction, Restricted Voting, and Extended Multiplicative. Our experiments using Support Vector Machine (SVM) classifiers on the Reuters collection have shown that they all could reduce blocking and improve the classification accuracy. Our experiments have also shown that the Restricted Voting method delivered the best performance. Index Terms--Data mining, text mining, classification.
- Published
- 2004
31. A Survey on Deep Learning for Named Entity Recognition.
- Author
-
Li, Jing, Sun, Aixin, Han, Jianglei, and Li, Chenliang
- Subjects
- *
DEEP learning , *ERGONOMICS , *MACHINE translating , *ENGINEERING design , *NATURAL language processing , *NAMED-entity recognition - Abstract
Named entity recognition (NER) is the task to identify mentions of rigid designators from text belonging to predefined semantic types such as person, location, organization etc. NER always serves as the foundation for many natural language applications such as question answering, text summarization, and machine translation. Early NER systems got a huge success in achieving good performance with the cost of human engineering in designing domain-specific features and rules. In recent years, deep learning, empowered by continuous real-valued vector representations and semantic composition through nonlinear processing, has been employed in NER systems, yielding stat-of-the-art performance. In this paper, we provide a comprehensive review on existing deep learning techniques for NER. We first introduce NER resources, including tagged NER corpora and off-the-shelf NER tools. Then, we systematically categorize existing works based on a taxonomy along three axes: distributed representations for input, context encoder, and tag decoder. Next, we survey the most representative methods for recent applied techniques of deep learning in new NER problem settings and applications. Finally, we present readers with the challenges faced by NER systems and outline future directions in this area. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
32. Nonenzymatic glucose detection using Au nanodots decorated Cu2O nanooctahedrons.
- Author
-
Chen, Dexiang, Xue, Kaifeng, Liu, Huaiqiang, Yao, Binbin, Sun, Aixin, Liu, Chenchen, Zhang, Pinhua, and Cui, Guangliang
- Subjects
GLUCOSE analysis ,BLOOD sugar monitoring ,GLUCOSE ,BLOOD sugar monitors ,SUBSTITUTION reactions - Abstract
Au nanodots decorated Cu
2 O nanooctahedrons were fabricated by a facile liquid-phase process combined with a galvanic replacement reaction for nonenzyme glucose detection. A simple rapid test strip based on the nanooctahedrons was proposed to evaluate the possibility of commercial application in nonenzymatic glucose detection. This test strip shows excellent response toward glucose. Linear response was obtained over a concentration ranging from 0.05 mM to 15 mM, and the detection accuracy is 0.05 mM. The good detection performance in selectivity, stability, and feasibility proving the great potential application in human blood glucose monitoring. This study demonstrated the possibility of a high-performance nonenzyme glucose test strip based on metal-oxide nanostructures decorated by catalysts. [ABSTRACT FROM AUTHOR]- Published
- 2021
- Full Text
- View/download PDF
33. Neural Named Entity Boundary Detection.
- Author
-
Li, Jing, Sun, Aixin, and Ma, Yukun
- Subjects
- *
RECURRENT neural networks - Abstract
In this paper, we focus on named entity boundary detection, which is to detect the start and end boundaries of an entity mention in text, without predicting its type. The detected entities are input to entity linking or fine-grained typing systems for semantic enrichment. We propose BdryBot, a recurrent neural network encoder-decoder framework with a pointer network to detect entity boundaries from a given sentence. The encoder considers both character-level representations and word-level embeddings to represent the input words. In this way, BdryBot does not require any hand-crafted features. Because of the pointer network, BdryBot overcomes the problem of variable size output vocabulary and the issue of sparse boundary tags. We conduct two sets of experiments, in-domain detection and cross-domain detection, on six datasets. Our results show that BdryBot achieves state-of-the-art performance against five baselines. In addition, our proposed approach can be further enhanced when incorporating contextualized language embeddings into token representations. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
34. Understanding the stability of medical concept embeddings.
- Author
-
Lee, Grace E. and Sun, Aixin
- Subjects
- *
INFORMATION science , *SEMANTICS - Abstract
Frequency is one of the major factors for training quality word embeddings. Several studies have recently discussed the stability of word embeddings in general domain and suggested factors influencing the stability. In this work, we conduct a detailed analysis on the stability of concept embeddings in medical domain, particularly in relations with concept frequency. The analysis reveals the surprising high stability of low‐frequency concepts: low‐frequency (<100) concepts have the same high stability as high‐frequency (>1,000) concepts. To develop a deeper understanding of this finding, we propose a new factor, the noisiness of context words, which influences the stability of medical concept embeddings regardless of high or low frequency. We evaluate the proposed factor by showing the linear correlation with the stability of medical concept embeddings. The correlations are clear and consistent with various groups of medical concepts. Based on the linear relations, we make suggestions on ways to adjust the noisiness of context words for the improvement of stability. Finally, we demonstrate that the linear relation of the proposed factor extends to the word embedding stability in general domain. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
35. Real-time dynamic network learning for location inference modelling and computing.
- Author
-
Li, Jianxin, Sun, Aixin, Guan, Ziyu, Cheema, Muhammad Aamir, and Min, Geyong
- Subjects
- *
PATTERN recognition systems , *SOCIAL networks , *ARTIFICIAL neural networks - Abstract
User location information contributes to in-depth social network data analytics. This special issue focuses on emerging techniques and trendy applications of real-time dynamic network learning in the fields such as neural network, dynamic network, spatial feature pattern recognition, and active learning. The research included at this special issue will also advance the location based services in real applications. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
36. Collective Named Entity Recognition in User Comments via Parameterized Label Propagation.
- Author
-
Phan, Minh C. and Sun, Aixin
- Subjects
- *
ALGORITHMS , *COMPARATIVE grammar , *INFORMATION science , *INTERNET , *LEARNING strategies , *NATURAL language processing , *SOCIAL networks , *SOCIAL media - Abstract
Named entity recognition (NER) in the past has focused on extracting mentions in a local region, within a sentence or short paragraph. When dealing with user‐generated text, the diverse and informal writing style makes traditional approaches much less effective. On the other hand, in many types of text on social media such as user comments, tweets, or question–answer posts, the contextual connections between documents do exist. Examples include posts in a thread discussing the same topic, tweets that share a hashtag about the same entity. Our idea in this work is utilizing the related contexts across documents to perform mention recognition in a collective manner. Intuitively, within a mention coreference graph, the labels of mentions are expected to propagate from more confidence cases to less confidence ones. To this end, we propose a novel semisupervised inference algorithm named parameterized label propagation. In our model, the propagation weights between mentions are learned by an attention‐like mechanism, given their local contexts and the initial labels as input. We study the performance of our approach in the Yahoo! News data set, where comments and articles within a thread share similar context. The results show that our model significantly outperforms all other noncollective NER baselines. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
37. Finding and classifying web units in websites
- Author
-
Sun, Aixin and Lim, Ee-Peng
- Subjects
Business, international - Abstract
Byline: Aixin Sun, Ee-Peng Lim In web classification, most researchers assume that the objects to be classified are individual web pages from one or more websites. In practice, the assumption is too restrictive since a web page itself may not carry sufficient information for it to be treated as an instance of some semantic class or concept. In this paper, we relax this assumption and allow a subgraph of web pages to represent an instance of the semantic concept. Such a subgraph of web pages is known as a web unit. To construct and classify web units, we formulate the web unit mining problem and propose an iterative web unit mining (iWUM) method. The iWUM method first finds subgraphs of web pages using knowledge about website structure and connectivity among the web pages. From these web subgraphs, web units are constructed and classified into categories in an iterative manner. Our experiments using the WebKB dataset showed that iWUM was able to construct web units and classify web units with high accuracy for the more structured parts of a website.
- Published
- 2005
38. Pair-Linking for Collective Entity Disambiguation: Two Could Be Better Than All.
- Author
-
Phan, Minh C., Sun, Aixin, Tay, Yi, Han, Jialong, and Li, Chenliang
- Subjects
- *
BIG data , *SEMANTICS , *COMPUTER algorithms , *PAIRED comparisons (Mathematics) , *MATHEMATICAL models of decision making - Abstract
Collective entity disambiguation, or collective entity linking aims to jointly resolve multiple mentions by linking them to their associated entities in a knowledge base. Previous works are primarily based on the underlying assumption that entities within the same document are highly related. However, the extent to which these entities are actually connected in reality is rarely studied and therefore raises interesting research questions. For the first time, this paper shows that the semantic relationships between mentioned entities within a document are in fact less dense than expected. This could be attributed to several reasons such as noise, data sparsity, and knowledge base incompleteness. As a remedy, we introduce MINTREE, a new tree-based objective for the problem of entity disambiguation. The key intuition behind MINTREE is the concept of coherence relaxation which utilizes the weight of a minimum spanning tree to measure the coherence between entities. Based on this new objective, we design Pair-Linking, a novel iterative solution for the MINTREE optimization problem. The idea of Pair-Linking is simple: instead of considering all the given mentions, Pair-Linking iteratively selects a pair with the highest confidence at each step for decision making. Via extensive experiments on eight benchmark datasets, we show that our approach is not only more accurate but also surprisingly faster than many state-of-the-art collective linking algorithms. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
39. Learning-Based Outdoor Localization Exploiting Crowd-Labeled WiFi Hotspots.
- Author
-
Wang, Jin, Luo, Jun, Pan, Sinno Jialin, and Sun, Aixin
- Subjects
WIRELESS hotspots ,METROPOLITAN areas ,GLOBAL Positioning System ,CITIES & towns ,HUMAN fingerprints - Abstract
The ever-expanding scale of WiFi deployments in metropolitan areas has made accurate GPS-free outdoor localization possible by relying solely on the WiFi infrastructure. Nevertheless, neither academic researches nor existing industrial practices seem to provide a satisfactory solution or implementation. In this paper, we propose WOLoc (WiFi-only Outdoor Localization) as a learning-based outdoor localization solution using only WiFi hotspots labeled by crowdsensing. On one hand, we do not take these labels as fingerprints as it is almost impossible to extend indoor localization mechanisms by fingerprinting metropolitan areas. On the other hand, we avoid the over-simplified local synthesis methods (e.g., centroid) that significantly lose the information contained in the labels. Instead, WOLoc adopts a semi-supervised manifold learning approach that accommodates all the labeled and unlabeled data for a given area, and the output concerning the unlabeled part will become the estimated locations for both unknown users and unknown WiFi hotspots. Moreover, WOLoc applies text mining techniques to analyze the SSIDs of hotspots, so as to derive more accurate input to its manifold learning. We conduct extensive experiments in several outdoor areas, and the results have strongly indicated the efficacy of our solution in achieving a meter-level localization accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
40. Collecting event‐related tweets from twitter stream.
- Author
-
Zheng, Xin and Sun, Aixin
- Subjects
- *
COMMUNICATION , *INFORMATION retrieval , *NATURAL disasters , *PUBLIC opinion , *INFORMATION resources , *SOCIAL media , *HUMAN services programs , *ACQUISITION of data - Abstract
Twitter provides a channel of collecting and publishing instant information on major events like natural disasters. However, information flow on Twitter is of great volume. For a specific event, messages collected from the Twitter Stream based on either location constraint or predefined keywords would contain a lot of noise. In this article, we propose a method to achieve both high‐precision and high‐recall in collecting event‐related tweets. Our method involves an automatic keyword generation component, and an event‐related tweet identification component. For keyword generation, we consider three properties of candidate keywords, namely relevance, coverage, and evolvement. The keyword updating mechanism enables our method to track the main topics of tweets along event development. To minimize annotation effort in identifying event‐related tweets, we adopt active learning and incorporate multiple‐instance learning which assigns labels to bags instead of instances (that is, individual tweets). Through experiments on two real‐world events, we demonstrate the superiority of our method against state‐of‐the‐art alternatives. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
41. To Do or Not To Do: Distill crowdsourced negative caveats to augment api documentation.
- Author
-
Li, Jing, Sun, Aixin, and Xing, Zhenchang
- Subjects
- *
COMPUTER science , *COMPUTER software , *INFORMATION science , *INFORMATION technology , *NATURAL language processing , *PROGRAMMING languages , *SEMANTICS , *DATA mining , *QUALITATIVE research , *QUANTITATIVE research , *CROWDSOURCING - Abstract
Negative caveats of application programming interfaces (APIs) are about "how not to use an API," which are often absent from the official API documentation. When these caveats are overlooked, programming errors may emerge from misusing APIs, leading to heavy discussions on Q&A websites like Stack Overflow. If the overlooked caveats could be mined from these discussions, they would be beneficial for programmers to avoid misuse of APIs. However, it is challenging because the discussions are informal, redundant, and diverse. For this, for example, we propose Disca, a novel approach for automatically Distilling desirable API negative caveats from unstructured Q&A discussions. Through sentence selection and prominent term clustering, Disca ensures that distilled caveats are context‐independent, prominent, semantically diverse, and nonredundant. Quantitative evaluation in our experiments shows that the proposed Disca significantly outperforms four text‐summarization techniques. We also show that the distilled API negative caveats could greatly augment API documentation through qualitative analysis. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
42. A Survey of Location Prediction on Twitter.
- Author
-
Zheng, Xin, Han, Jialong, and Sun, Aixin
- Subjects
LOCATION analysis ,AUTOMATIC identification ,SOCIAL network analysis ,MICROBLOGS -- Social aspects - Abstract
Locations, e.g., countries, states, cities, and point-of-interests, are central to news, emergency events, and people's daily lives. Automatic identification of locations associated with or mentioned in documents has been explored for decades. As one of the most popular online social network platforms, Twitter has attracted a large number of users who send millions of tweets on daily basis. Due to the world-wide coverage of its users and real-time freshness of tweets, location prediction on Twitter has gained significant attention in recent years. Research efforts are spent on dealing with new challenges and opportunities brought by the noisy, short, and context-rich nature of tweets. In this survey, we aim at offering an overall picture of location prediction on Twitter. Specifically, we concentrate on the prediction of user home locations, tweet locations, and mentioned locations. We first define the three tasks and review the evaluation metrics. By summarizing Twitter network, tweet content, and tweet context as potential inputs, we then structurally highlight how the problems depend on these inputs. Each dependency is illustrated by a comprehensive review of the corresponding strategies adopted in state-of-the-art approaches. In addition, we also briefly review two related problems, i.e., semantic location prediction and point-of-interest recommendation. Finally, we make a conclusion of the survey and list future research directions. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
43. Linking Fine-Grained Locations in User Comments.
- Author
-
Han, Jialong, Sun, Aixin, Cong, Gao, Zhao, Wayne Xin, Ji, Zongcheng, and Phan, Minh C.
- Subjects
- *
WEBSITES , *MICROBLOGS , *CROWDSOURCING , *SOCIAL networks , *INFORMATION sharing - Abstract
Many domain-specific websites host a profile page for each entity (e.g., locations on Foursquare, movies on IMDb, and products on Amazon) for users to post comments on. When commenting on an entity, users often mention other entities for reference or comparison. Compared with web pages and tweets, the problem of disambiguating the mentioned entities in user comments has not received much attention. This paper investigates linking fine-grained locations in Foursquare comments. We demonstrate that the focal location, i.e., the location that a comment is posted on, provides rich contexts for the linking task. To exploit such information, we represent the Foursquare data in a graph, which includes locations, comments, and their relations. A probabilistic model named
FocalLink is proposed to estimate the probability that a user mentions a location when commenting on a focal location, by following different kinds of relations. Experimental results show thatFocalLink is consistently superior under different collective linking settings. [ABSTRACT FROM AUTHOR]- Published
- 2018
- Full Text
- View/download PDF
44. Exploring prestigious citations sourced from top universities in bibliometrics and altmetrics: a case study in the computer science discipline.
- Author
-
Luo, Feiheng, Sun, Aixin, Erdt, Mojisola, Sesagiri Raamkumar, Aravind, and Theng, Yin-Leng
- Abstract
Citation count is an important indicator for measuring research outputs. There have been numerous studies that have investigated factors affecting citation counts from the perspectives of cited papers and citing papers. In this paper, we focused specifically on citing papers and explored citations sourced from prestigious affiliations in the computer science discipline. The QS World University Rankings was employed to identify prestigious citations, named QS citations. We used the Microsoft Academic Graph, a massive scholarly dataset, and conducted different kinds of analysis between papers with QS citations and those without QS citations. We discovered that papers with QS citations are generally associated with higher total citation counts than those without QS citations. We extended the analysis to authors and journals, and the results indicated that when authors or journals have higher proportions of papers with QS citations, they are usually associated with higher values of the H-index or the Journal Impact Factor respectively. Additionally, papers with QS citations are also associated with a higher Altmetric Attention Score and a higher number of specific types of altmetrics such as tweet counts. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
45. An analysis of 14 Million tweets on hashtag-oriented spamming*.
- Author
-
Sedhai, Surendra and Sun, Aixin
- Subjects
- *
CONTENT analysis , *DECEPTION , *METADATA , *SOCIAL media , *CROWDSOURCING - Abstract
Over the years, Twitter has become a popular platform for information dissemination and information gathering. However, the popularity of Twitter has attracted not only legitimate users but also spammers who exploit social graphs, popular keywords, and hashtags for malicious purposes. In this paper, we present a detailed analysis of the HSpam14 dataset, which contains 14 million tweets with spam and ham (i.e., nonspam) labels, to understand spamming activities on Twitter. The primary focus of this paper is to analyze various aspects of spam on Twitter based on hashtags, tweet contents, and user profiles, which are useful for both tweet-level and user-level spam detection. First, we compare the usage of hashtags in spam and ham tweets based on frequency, position, orthography, and co-occurrence. Second, for content-based analysis, we analyze the variations in word usage, metadata, and near-duplicate tweets. Third, for user-based analysis, we investigate user profile information. In our study, we validate that spammers use popular hashtags to promote their tweets. We also observe differences in the usage of words in spam and ham tweets. Spam tweets are more likely to be emphasized using exclamation points and capitalized words. Furthermore, we observe that spammers use multiple accounts to post near-duplicate tweets to promote their services and products. Unlike spammers, legitimate users are likely to provide more information such as their locations and personal descriptions in their profiles. In summary, this study presents a comprehensive analysis of hashtags, tweet contents, and user profiles in Twitter spamming. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
46. Extracting fine-grained location with temporal awareness in tweets: A two-stage approach.
- Author
-
Li, Chenliang and Sun, Aixin
- Subjects
- *
LINGUISTICS , *RESEARCH funding , *SOCIAL skills , *TIME , *SOCIAL media - Abstract
Twitter has attracted billions of users for life logging and sharing activities and opinions. In their tweets, users often reveal their location information and short-term visiting histories or plans. Capturing user's short-term activities could benefit many applications for providing the right context at the right time and location. In this paper we are interested in extracting locations mentioned in tweets at fine-grained granularity, with temporal awareness. Specifically, we recognize the points-of-interest (POIs) mentioned in a tweet and predict whether the user has visited, is currently at, or will soon visit the mentioned POIs. A POI can be a restaurant, a shopping mall, a bookstore, or any other fine-grained location. Our proposed framework, named TS -P etar (Two-Stage POI Extractor with Temporal Awareness), consists of two main components: a POI inventory and a two-stage time-aware POI tagger. The POI inventory is built by exploiting the crowd wisdom of the Foursquare community. It contains both POIs' formal names and their informal abbreviations, commonly observed in Foursquare check-ins. The time-aware POI tagger, based on the Conditional Random Field (CRF) model, is devised to disambiguate the POI mentions and to resolve their associated temporal awareness accordingly. Three sets of contextual features (linguistic, temporal, and inventory features) and two labeling schema features (OP and BILOU schemas) are explored for the time-aware POI extraction task. Our empirical study shows that the subtask of POI disambiguation and the subtask of temporal awareness resolution call for different feature settings for best performance. We have also evaluated the proposed TS -P etar against several strong baseline methods. The experimental results demonstrate that the two-stage approach achieves the best accuracy and outperforms all baseline methods in terms of both effectiveness and efficiency. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
47. Tweet Segmentation and Its Application to Named Entity Recognition.
- Author
-
Li, Chenliang, Sun, Aixin, Weng, Jianshu, and He, Qi
- Subjects
- *
ENTITY-relationship modeling , *MICROBLOGS , *INFORMATION retrieval , *NATURAL language processing , *LINGUISTIC analysis - Abstract
Twitter has attracted millions of users to share and disseminate most up-to-date information, resulting in large volumes of data produced everyday. However, many applications in Information Retrieval (IR) and Natural Language Processing (NLP) suffer severely from the noisy and short nature of tweets. In this paper, we propose a novel framework for tweet segmentation in a batch mode, called HybridSeg. By splitting tweets into meaningful segments, the semantic or context information is well preserved and easily extracted by the downstream applications. HybridSeg finds the optimal segmentation of a tweet by maximizing the sum of the stickiness scores of its candidate segments. The stickiness score considers the probability of a segment being a phrase in English (i.e., global context) and the probability of a segment being a phrase within the batch of tweets (i.e., local context). For the latter, we propose and evaluate two models to derive local context by considering the linguistic features and term-dependency in a batch of tweets, respectively. HybridSeg is also designed to iteratively learn from confident segments as pseudo feedback. Experiments on two tweet data sets show that tweet segmentation quality is significantly improved by learning both global and local contexts compared with using global context alone. Through analysis and comparison, we show that local linguistic features are more reliable for learning local context compared with term-dependency. As an application, we show that high accuracy is achieved in named entity recognition by applying segment-based part-of-speech (POS) tagging. [ABSTRACT FROM PUBLISHER]
- Published
- 2015
- Full Text
- View/download PDF
48. On predicting the popularity of newly emerging hashtags in Twitter.
- Author
-
Ma, Zongyang, Sun, Aixin, and Cong, Gao
- Subjects
- *
CLASSIFICATION , *EXPERIMENTAL design , *INTERNET , *PRESS , *PUBLIC opinion , *SOCIAL skills , *BLOGS , *ACCESS to information , *SOCIAL media - Abstract
Because of Twitter's popularity and the viral nature of information dissemination on Twitter, predicting which Twitter topics will become popular in the near future becomes a task of considerable economic importance. Many Twitter topics are annotated by hashtags. In this article, we propose methods to predict the popularity of new hashtags on Twitter by formulating the problem as a classification task. We use five standard classification models (i.e., Naïve bayes, k-nearest neighbors, decision trees, support vector machines, and logistic regression) for prediction. The main challenge is the identification of effective features for describing new hashtags. We extract 7 content features from a hashtag string and the collection of tweets containing the hashtag and 11 contextual features from the social graph formed by users who have adopted the hashtag. We conducted experiments on a Twitter data set consisting of 31 million tweets from 2 million Singapore-based users. The experimental results show that the standard classifiers using the extracted features significantly outperform the baseline methods that do not use these features. Among the five classifiers, the logistic regression model performs the best in terms of the Micro- F1 measure. We also observe that contextual features are more effective than content features. [ABSTRACT FROM AUTHOR]
- Published
- 2013
- Full Text
- View/download PDF
49. TSDW: Two-stage word sense disambiguation using Wikipedia.
- Author
-
Li, Chenliang, Sun, Aixin, and Datta, Anwitaman
- Subjects
- *
ALGORITHMS , *INTERNET , *LANGUAGE & languages , *SEMANTICS , *REFERENCE sources , *ACCESS to information - Abstract
The semantic knowledge of Wikipedia has proved to be useful for many tasks, for example, named entity disambiguation. Among these applications, the task of identifying the word sense based on Wikipedia is a crucial component because the output of this component is often used in subsequent tasks. In this article, we present a two-stage framework (called TSDW) for word sense disambiguation using knowledge latent in Wikipedia. The disambiguation of a given phrase is applied through a two-stage disambiguation process: (a) The first-stage disambiguation explores the contextual semantic information, where the noisy information is pruned for better effectiveness and efficiency; and (b) the second-stage disambiguation explores the disambiguated phrases of high confidence from the first stage to achieve better redisambiguation decisions for the phrases that are difficult to disambiguate in the first stage. Moreover, existing studies have addressed the disambiguation problem for English text only. Considering the popular usage of Wikipedia in different languages, we study the performance of TSDW and the existing state-of-the-art approaches over both English and Traditional Chinese articles. The experimental results show that TSDW generalizes well to different semantic relatedness measures and text in different languages. More important, TSDW significantly outperforms the state-of-the-art approaches with both better effectiveness and efficiency. [ABSTRACT FROM AUTHOR]
- Published
- 2013
- Full Text
- View/download PDF
50. An evaluation of classification models for question topic categorization.
- Author
-
Qu, Bo, Cong, Gao, Li, Cuiping, Sun, Aixin, and Chen, Hong
- Subjects
CLASSIFICATION ,INFORMATION retrieval ,INFORMATION services ,INTERNET ,ONLINE information services - Abstract
We study the problem of question topic classification using a very large real-world Community Question Answering ( CQA) dataset from Yahoo! Answers. The dataset comprises 3.9 million questions and these questions are organized into more than 1,000 categories in a hierarchy. To the best knowledge, this is the first systematic evaluation of the performance of different classification methods on question topic classification as well as short texts. Specifically, we empirically evaluate the following in classifying questions into CQA categories: (a) the usefulness of n-gram features and bag-of-word features; (b) the performance of three standard classification algorithms (naive Bayes, maximum entropy, and support vector machines); (c) the performance of the state-of-the-art hierarchical classification algorithms; (d) the effect of training data size on performance; and (e) the effectiveness of the different components of CQA data, including subject, content, asker, and the best answer. The experimental results show what aspects are important for question topic classification in terms of both effectiveness and efficiency. We believe that the experimental findings from this study will be useful in real-world classification problems. [ABSTRACT FROM AUTHOR]
- Published
- 2012
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.