Author: "Qiaozhu Mei" / Search Limiters: Full Text - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Qiaozhu Mei"' showing total 83 results

Start Over Author "Qiaozhu Mei" Search Limiters Full Text

83 results on '"Qiaozhu Mei"'

1. PM2.5 forecasting under distribution shift: A graph learning approach

Author: Yachuan Liu, Jiaqi Ma, Paramveer Dhillon, and Qiaozhu Mei
Subjects: Spatial–temporal graph neural networks, PM2.5 forecasting, Distribution shift, Electronic computers. Computer science, QA75.5-76.95
Abstract: We present a new benchmark task for graph-based machine learning, aiming to predict future air quality (PM2.5 concentration) observed by a geographically distributed network of environmental sensors. While prior work has successfully applied Graph Neural Networks (GNNs) on a wide family of spatio-temporal prediction tasks, the new benchmark task introduced here brings a technical challenge that has been less studied in the context of graph-based spatio-temporal learning: distribution shift across a long period of time. An important goal of this paper is to understand the behavior of spatio-temporal GNNs under distribution shift. We conduct a comprehensive comparative study of both graph-based and non-graph-based machine learning models under two data split methods, one results in distribution shift and one does not. Our empirical results suggest that GNN models tend to suffer more from distribution shift compared to non-graph-based models, which calls for special attention when deploying spatio-temporal GNNs in practice.
Published: 2024
Full Text: View/download PDF

2. Developing a Semantically Based Query Recommendation for an Electronic Medical Record Search Engine: Query Log Analysis and Design Implications

Author: Danny T Y Wu, David Hanauer, Paul Murdock, V G Vinod Vydiswaran, Qiaozhu Mei, and Kai Zheng
Subjects: Medicine
Abstract: BackgroundAn effective and scalable information retrieval (IR) system plays a crucial role in enabling clinicians and researchers to harness the valuable information present in electronic health records. In a previous study, we developed a prototype medical IR system, which incorporated a semantically based query recommendation (SBQR) feature. The system was evaluated empirically and demonstrated high perceived performance by end users. To delve deeper into the factors contributing to this perceived performance, we conducted a follow-up study using query log analysis. ObjectiveOne of the primary challenges faced in IR is that users often have limited knowledge regarding their specific information needs. Consequently, an IR system, particularly its user interface, needs to be thoughtfully designed to assist users through the iterative process of refining their queries as they encounter relevant documents during their search. To address these challenges, we incorporated “query recommendation” into our Electronic Medical Record Search Engine (EMERSE), drawing inspiration from the success of similar features in modern IR systems for general purposes. MethodsThe query log data analyzed in this study were collected during our previous experimental study, where we developed EMERSE with the SBQR feature. We implemented a logging mechanism to capture user query behaviors and the output of the IR system (retrieved documents). In this analysis, we compared the initial query entered by users with the query formulated with the assistance of the SBQR. By examining the results of this comparison, we could examine whether the use of SBQR helped in constructing improved queries that differed from the original ones. ResultsOur findings revealed that the first query entered without SBQR and the final query with SBQR assistance were highly similar (Jaccard similarity coefficient=0.77). This suggests that the perceived positive performance of the system was primarily attributed to the automatic query expansion facilitated by the SBQR rather than users manually manipulating their queries. In addition, through entropy analysis, we observed that search results converged in scenarios of moderate difficulty, and the degree of convergence correlated strongly with the perceived system performance. ConclusionsThe study demonstrated the potential contribution of the SBQR in shaping participants' positive perceptions of system performance, contingent upon the difficulty of the search scenario. Medical IR systems should therefore consider incorporating an SBQR as a user-controlled option or a semiautomated feature. Future work entails redesigning the experiment in a more controlled manner and conducting multisite studies to demonstrate the effectiveness of EMERSE with SBQR for patient cohort identification. By further exploring and validating these findings, we can enhance the usability and functionality of medical IR systems in real-world settings.
Published: 2023
Full Text: View/download PDF

3. When BERT meets Bilbo: a learning curve analysis of pretrained language model on disease classification

Author: Xuedong Li, Walter Yuan, Dezhong Peng, Qiaozhu Mei, and Yue Wang
Subjects: Learning curve, Bidirectional encoder representations from transformers, Disease classification, Computer applications to medicine. Medical informatics, R858-859.7
Abstract: Abstract Background Natural language processing (NLP) tasks in the health domain often deal with limited amount of labeled data due to high annotation costs and naturally rare observations. To compensate for the lack of training data, health NLP researchers often have to leverage knowledge and resources external to a task at hand. Recently, pretrained large-scale language models such as the Bidirectional Encoder Representations from Transformers (BERT) have been proven to be a powerful way of learning rich linguistic knowledge from massive unlabeled text and transferring that knowledge to downstream tasks. However, previous downstream tasks often used training data at such a large scale that is unlikely to obtain in the health domain. In this work, we aim to study whether BERT can still benefit downstream tasks when training data are relatively small in the context of health NLP. Method We conducted a learning curve analysis to study the behavior of BERT and baseline models as training data size increases. We observed the classification performance of these models on two disease diagnosis data sets, where some diseases are naturally rare and have very limited observations (fewer than 2 out of 10,000). The baselines included commonly used text classification models such as sparse and dense bag-of-words models, long short-term memory networks, and their variants that leveraged external knowledge. To obtain learning curves, we incremented the amount of training examples per disease from small to large, and measured the classification performance in macro-averaged $$F_{1}$$ F 1 score. Results On the task of classifying all diseases, the learning curves of BERT were consistently above all baselines, significantly outperforming them across the spectrum of training data sizes. But under extreme situations where only one or two training documents per disease were available, BERT was outperformed by linear classifiers with carefully engineered bag-of-words features. Conclusion As long as the amount of training documents is not extremely few, fine-tuning a pretrained BERT model is a highly effective approach to health NLP tasks like disease classification. However, in extreme cases where each class has only one or two training documents and no more will be available, simple linear models using bag-of-words features shall be considered.
Published: 2022
Full Text: View/download PDF

4. Emojis predict dropouts of remote workers: An empirical study of emoji usage on GitHub

Author: Xuan Lu, Wei Ai, Zhenpeng Chen, Yanbin Cao, and Qiaozhu Mei
Subjects: Medicine, Science
Abstract: Emotions at work have long been identified as critical signals of work motivations, status, and attitudes, and as predictors of various work-related outcomes. When more and more employees work remotely, these emotional signals of workers become harder to observe through daily, face-to-face communications. The use of online platforms to communicate and collaborate at work provides an alternative channel to monitor the emotions of workers. This paper studies how emojis, as non-verbal cues in online communications, can be used for such purposes and how the emotional signals in emoji usage can be used to predict future behavior of workers. In particular, we present how the developers on GitHub use emojis in their work-related activities. We show that developers have diverse patterns of emoji usage, which can be related to their working status including activity levels, types of work, types of communications, time management, and other behavioral patterns. Developers who use emojis in their posts are significantly less likely to dropout from the online work platform. Surprisingly, solely using emoji usage as features, standard machine learning models can predict future dropouts of developers at a satisfactory accuracy. Features related to the general use and the emotions of emojis appear to be important factors, while they do not rule out paths through other purposes of emoji use.
Published: 2022

5. Improving rare disease classification using imperfect knowledge graph

Author: Xuedong Li, Yue Wang, Dongwu Wang, Walter Yuan, Dezhong Peng, and Qiaozhu Mei
Subjects: Rare disease diagnosis, Knowledge graph, Machine learning, Text classification, Extremely imbalanced data, Computer applications to medicine. Medical informatics, R858-859.7
Abstract: Abstract Background Accurately recognizing rare diseases based on symptom description is an important task in patient triage, early risk stratification, and target therapies. However, due to the very nature of rare diseases, the lack of historical data poses a great challenge to machine learning-based approaches. On the other hand, medical knowledge in automatically constructed knowledge graphs (KGs) has the potential to compensate the lack of labeled training examples. This work aims to develop a rare disease classification algorithm that makes effective use of a knowledge graph, even when the graph is imperfect. Method We develop a text classification algorithm that represents a document as a combination of a “bag of words” and a “bag of knowledge terms,” where a “knowledge term” is a term shared between the document and the subgraph of KG relevant to the disease classification task. We use two Chinese disease diagnosis corpora to evaluate the algorithm. The first one, HaoDaiFu, contains 51,374 chief complaints categorized into 805 diseases. The second data set, ChinaRe, contains 86,663 patient descriptions categorized into 44 disease categories. Results On the two evaluation data sets, the proposed algorithm delivers robust performance and outperforms a wide range of baselines, including resampling, deep learning, and feature selection approaches. Both classification-based metric (macro-averaged F 1 score) and ranking-based metric (mean reciprocal rank) are used in evaluation. Conclusion Medical knowledge in large-scale knowledge graphs can be effectively leveraged to improve rare diseases classification models, even when the knowledge graph is incomplete.
Published: 2019
Full Text: View/download PDF

6. Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification

Author: David A. Hanauer, Qiaozhu Mei, V. G. Vinod Vydiswaran, Karandeep Singh, Zach Landis-Lewis, and Chunhua Weng
Subjects: Lexical variation, Natural language processing, Information retrieval, Computer applications to medicine. Medical informatics, R858-859.7
Abstract: Abstract Background Numbers and numerical concepts appear frequently in free text clinical notes from electronic health records. Knowledge of the frequent lexical variations of these numerical concepts, and their accurate identification, is important for many information extraction tasks. This paper describes an analysis of the variation in how numbers and numerical concepts are represented in clinical notes. Methods We used an inverted index of approximately 100 million notes to obtain the frequency of various permutations of numbers and numerical concepts, including the use of Roman numerals, numbers spelled as English words, and invalid dates, among others. Overall, twelve types of lexical variants were analyzed. Results We found substantial variation in how these concepts were represented in the notes, including multiple data quality issues. We also demonstrate that not considering these variations could have substantial real-world implications for cohort identification tasks, with one case missing > 80% of potential patients. Conclusions Numbering within clinical notes can be variable, and not taking these variations into account could result in missing or inaccurate information for natural language processing and information retrieval tasks.
Published: 2019
Full Text: View/download PDF

7. An active learning-enabled annotation system for clinical named entity recognition

Author: Yukun Chen, Thomas A. Lask, Qiaozhu Mei, Qingxia Chen, Sungrim Moon, Jingqi Wang, Ky Nguyen, Tolulola Dawodu, Trevor Cohen, Joshua C. Denny, and Hua Xu
Subjects: Computer applications to medicine. Medical informatics, R858-859.7
Abstract: Abstract Background Active learning (AL) has shown the promising potential to minimize the annotation cost while maximizing the performance in building statistical natural language processing (NLP) models. However, very few studies have investigated AL in a real-life setting in medical domain. Methods In this study, we developed the first AL-enabled annotation system for clinical named entity recognition (NER) with a novel AL algorithm. Besides the simulation study to evaluate the novel AL algorithm, we further conducted user studies with two nurses using this system to assess the performance of AL in real world annotation processes for building clinical NER models. Results The simulation results show that the novel AL algorithm outperformed traditional AL algorithm and random sampling. However, the user study tells a different story that AL methods did not always perform better than random sampling for different users. Conclusions We found that the increased information content of actively selected sentences is strongly offset by the increased time required to annotate them. Moreover, the annotation time was not considered in the querying algorithms. Our future work includes developing better AL algorithms with the estimation of annotation time and evaluating the system with larger number of users.
Published: 2017
Full Text: View/download PDF

8. A Turing test of whether AI chatbots are behaviorally similar to humans.

Author: Qiaozhu Mei, Yutong Xie, Walter Yuan, and Jackson, Matthew O.
Subjects: *TURING test, *CHATBOTS, *BEHAVIOR modification, *ARTIFICIAL intelligence, *CHATGPT, *PERSONALITY
Abstract: We administer a Turing test to AI chatbots. We examine how chatbots behave in a suite of classic behavioral games that are designed to elicit characteristics such as trust, fairness, risk-aversion, cooperation, etc., as well as how they respond to a traditional Big-5 psychological survey that measures personality traits. ChatGPT-4 exhibits behavioral and personality traits that are statistically indistinguishable from a random human from tens of thousands of human subjects from more than 50 countries. Chatbots also modify their behavior based on previous experience and contexts "as if" they were learning from the interactions and change their behavior in response to different framings of the same strategic situation. Their behaviors are often distinct from average and modal human behaviors, in which case they tend to behave on the more altruistic and cooperative end of the distribution. We estimate that they act as if they are maximizing an average of their own and partner's payoffs. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

9. A Prompt Log Analysis of Text-to-Image Generation Systems

Author: Yutong Xie, Zhaoying Pan, Jinge Ma, Luo Jie, and Qiaozhu Mei
Subjects: FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Human-Computer Interaction, Computer Science - Computer Vision and Pattern Recognition, Information Retrieval (cs.IR), Human-Computer Interaction (cs.HC), Computer Science - Information Retrieval
Abstract: Recent developments in large language models (LLM) and generative AI have unleashed the astonishing capabilities of text-to-image generation systems to synthesize high-quality images that are faithful to a given reference text, known as a "prompt". These systems have immediately received lots of attention from researchers, creators, and common users. Despite the plenty of efforts to improve the generative models, there is limited work on understanding the information needs of the users of these systems at scale. We conduct the first comprehensive analysis of large-scale prompt logs collected from multiple text-to-image generation systems. Our work is analogous to analyzing the query logs of Web search engines, a line of work that has made critical contributions to the glory of the Web search industry and research. Compared with Web search queries, text-to-image prompts are significantly longer, often organized into special structures that consist of the subject, form, and intent of the generation tasks and present unique categories of information needs. Users make more edits within creation sessions, which present remarkable exploratory patterns. There is also a considerable gap between the user-input prompts and the captions of the images included in the open training data of the generative models. Our findings provide concrete implications on how to improve text-to-image generation systems for creation purposes.
Published: 2023
Full Text: View/download PDF

10. Putting Teams into the Gig Economy: A Field Experiment at a Ride-Sharing Platform

Author: Wei Ai, Yan Chen, Qiaozhu Mei, Jieping Ye, and Lingyu Zhang
Subjects: Strategy and Management, Management Science and Operations Research
Abstract: The gig economy provides workers with the benefits of autonomy and flexibility but at the expense of work identity and coworker bonds. Among the many reasons why gig workers leave their platforms, one unexplored aspect is the lack of an organization identity. In this study, we develop a team formation and interteam contest field experiment at a ride-sharing platform. We assign drivers to teams either randomly or based on similarity in age, hometown location, or productivity. Having these teams compete for cash prizes, we find that (1) compared with those in the control condition, treated drivers work longer hours and earn 12% higher revenue during the contest; (2) the treatment effect persists two weeks postcontest, albeit with half of the effect size; and (3) drivers in hometown-similar teams are more likely to communicate with each other, whereas those in age-similar teams continue to work longer hours and earn higher revenue during the two weeks after the contest ends. Together, our results show that platform designers can leverage team identity and team contests to increase revenue and worker engagement in a gig economy. This paper was accepted by David Simchi-Levi, behavioral economics and decision analysis. Funding: Financial support from the platform through the Michigan Institute for Data Science is gratefully acknowledged. Supplemental Material: The e-companion are data are available at https://doi.org/10.1287/mnsc.2022.4624 .
Published: 2023
Full Text: View/download PDF

11. Virtual teams in a gig economy

Author: Teng Ye, Wei Ai, Yan Chen, Qiaozhu Mei, Jieping Ye, and Lingyu Zhang
Subjects: Multidisciplinary
Abstract: While the gig economy provides flexible jobs for millions of workers globally, a lack of organization identity and coworker bonds contributes to their low engagement and high attrition rates. To test the impact of virtual teams on worker productivity and retention, we conduct a field experiment with 27,790 drivers on a ride-sharing platform. We organize drivers into teams that are randomly assigned to receiving their team ranking, or individual ranking within their team, or individual performance information (control). We find that treated drivers work longer hours and generate significantly higher revenue. Furthermore, drivers in the team-ranking treatment continue to be more engaged 3 mo after the end of the experiment. A machine-learning analysis of 149 team contests in 86 cities suggests that social comparison, driver experience, and within-team similarity are the key predictors of the virtual team efficacy.
Published: 2022
Full Text: View/download PDF

12. Classifying the Political Leaning of News Articles and Users from User Votes

Author: Daniel Xiaodan Zhou, Paul Resnick, and Qiaozhu Mei
Abstract: Social news aggregator services generate readers’ subjective reactions to news opinion articles. Can we use those as a resource to classify articles as liberal or conservative, even without knowing the self-identified political leaning of most users? We applied three semi-supervised learning methods that propagate classifications of political news articles and users as conservative or liberal, based on the assumption that liberal users will vote for liberal articles more often, and similarly for conservative users and articles. Starting from a few labeled articles and users, the algorithms propagate political leaning labels to the entire graph. In cross-validation, the best algorithm achieved 99.6% accuracy on held-out users and 96.3% accuracy on held-out articles. Adding social data such as users’ friendship or text features such as cosine similarity did not improve accuracy. The propagation algorithms, using the subjective liking data from users, also performed better than an SVM based text classifier, which achieved 92.0% accuracy on articles.
Published: 2021
Full Text: View/download PDF

13. Audience Analysis for Competing Memes in Social Media

Author: Samuel Carton, Souneil Park, Nicole Zeffer, Eytan Adar, Qiaozhu Mei, and Paul Resnick
Abstract: Existing tools for exploratory analysis of information diffusion in social media focus on the message senders who actively diffuse the meme. We develop a tool for audience analysis, focusing on the people who are passively exposed to the messages, with a special emphasis on competing memes such as propagations and corrections of a rumor. In such competing meme diffusions, important questions include which meme reached a bigger total audience, the overlap in audiences of the two, and whether exposure to one meme inhibited propagation of the other. We track audience members’ states of interaction, such as having been exposed to one meme or another or both. We analyze the marginal impact of each message in terms of the number of people who transition between states as a result of that message. These marginal impacts can be computed efficiently, even for diffusions involving thousands of senders and millions of receivers. The marginal impacts provide the raw material for an interactive tool, RumorLens, that includes a Sankey diagram and a network diagram. We validate the utility of the tool through a case study of nine rumor diffusions. We validate the usability of the tool through a user study, showing that nonexperts are able to use it to answer audience analysis questions.
Published: 2021
Full Text: View/download PDF

14. Unexpected Relevance: An Empirical Study of Serendipity in Retweets

Author: Tao Sun, Ming Zhang, and Qiaozhu Mei
Abstract: Serendipity is a beneficial discovery that happens in an unexpected way. It has been found spectacularly valuable in various contexts, including scientific discoveries, acquisition of business, and recommender systems. Although never formally proved with large-scale behavioral analysis, it is believed by scientists and practitioners that serendipity is an important factor of positive user experience and increased user engagement. In this paper, we take the initiative to study the ubiquitous occurrence of serendipitious information diffusion and its effect in the context of microblogging communities. We refer to serendipity as unexpected relevance, then propose a principled statistical method to test the unexpectedness and the relevance of information received by a microblogging user, which identifies a serendipitous diffusion of information to the user. Our findings based on large-scale behavioral analysis reveal that there is a surprisingly strong presence of serendipitous information diffusion in retweeting, which accounts for more than 25% of retweets in both Twitter and Weibo. Upon the identification of serendipity, we are able to conduct observational analysis that reveals the benefit of serendipity to microblogging users. Results show that both the discovery and provision of serendipity increase the level of user activities and social interactions, while the provision of serendipitous information also increases the influence of Twitter users.
Published: 2021
Full Text: View/download PDF

15. Feature-Based Explanations Don't Help People Detect Misclassifications of Online Toxicity

Author: Samuel Carton, Qiaozhu Mei, and Paul Resnick
Abstract: We present an experimental assessment of the impact of feature attribution-style explanations on human performance in predicting the consensus toxicity of social media posts with advice from an unreliable machine learning model. By doing so we add to a small but growing body of literature inspecting the utility of interpretable machine learning in terms of human outcomes. We also evaluate interpretable machine learning for the first time in the important domain of online toxicity, where fully-automated methods have faced criticism as being inadequate as a measure of toxic behavior.We find that, contrary to expectations, explanations have no significant impact on accuracy or agreement with model predictions, through they do change the distribution of subject error somewhat while reducing the cognitive burden of the task for subjects. Our results contribute to the recognition of an intriguing expectation gap in the field of interpretable machine learning between the general excitement the field has engendered and the ambiguous results of recent experimental work, including this study.
Published: 2020
Full Text: View/download PDF

16. When BERT meets Bilbo: a learning curve analysis of pretrained language model on disease classification

Author: Walter Yuan, Yue Wang, Xuedong Li, Dezhong Peng, and Qiaozhu Mei
Subjects: Computer science, business.industry, Health Policy, education, Context (language use), Health Informatics, Machine learning, computer.software_genre, Domain (software engineering), Task (project management), Computer Science Applications, Learning curve, Humans, Artificial intelligence, Language model, business, Baseline (configuration management), Encoder, computer, Learning Curve, Language, Natural Language Processing, Transformer (machine learning model)
Abstract: BackgroundNatural language processing (NLP) tasks in the health domain often deal with limited amount of labeled data due to high annotation costs and naturally rare observations. To compensate for the lack of training data, health NLP researchers often have to leverage knowledge and resources external to a task at hand. Recently, pretrained large-scale language models such as the Bidirectional Encoder Representations from Transformers (BERT) have been proven to be a powerful way of learning rich linguistic knowledge from massive unlabeled text and transferring that knowledge to downstream tasks. However, previous downstream tasks often used training data at such a large scale that is unlikely to obtain in the health domain. In this work, we aim to study whether BERT can still benefit downstream tasks when training data are relatively small in the context of health NLP.MethodWe conducted a learning curve analysis to study the behavior of BERT and baseline models as training data size increases. We observed the classification performance of these models on two disease diagnosis data sets, where some diseases are naturally rare and have very limited observations (fewer than 2 out of 10,000). The baselines included commonly used text classification models such as sparse and dense bag-of-words models, long short-term memory networks, and their variants that leveraged external knowledge. To obtain learning curves, we incremented the amount of training examples per disease from small to large, and measured the classification performance in macro-averaged$$F_{1}$$F1score.ResultsOn the task of classifying all diseases, the learning curves of BERT were consistently above all baselines, significantly outperforming them across the spectrum of training data sizes. But under extreme situations where only one or two training documents per disease were available, BERT was outperformed by linear classifiers with carefully engineered bag-of-words features.ConclusionAs long as the amount of training documents is not extremely few, fine-tuning a pretrained BERT model is a highly effective approach to health NLP tasks like disease classification. However, in extreme cases where each class has only one or two training documents and no more will be available, simple linear models using bag-of-words features shall be considered.
Published: 2021
Full Text: View/download PDF

17. Explainable Prediction of Text Complexity: The Missing Preliminaries for Text Simplification

Author: Samuel Carton, Qiaozhu Mei, Mengtian Guo, and Cristina Garbacea
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Theoretical computer science, Artificial neural network, Language complexity, Computer Science - Artificial Intelligence, Text simplification, Process (engineering), Computer science, business.industry, Pipeline (computing), Deep learning, Transparency (human–computer interaction), Machine Learning (cs.LG), Artificial Intelligence (cs.AI), Margin (machine learning), Artificial intelligence, business, Computation and Language (cs.CL)
Abstract: Text simplification reduces the language complexity of professional content for accessibility purposes. End-to-end neural network models have been widely adopted to directly generate the simplified version of input text, usually functioning as a blackbox. We show that text simplification can be decomposed into a compact pipeline of tasks to ensure the transparency and explainability of the process. The first two steps in this pipeline are often neglected: 1) to predict whether a given piece of text needs to be simplified, and 2) if yes, to identify complex parts of the text. The two tasks can be solved separately using either lexical or deep learning methods, or solved jointly. Simply applying explainable complexity prediction as a preliminary step, the out-of-sample text simplification performance of the state-of-the-art, black-box simplification models can be improved by a large margin., ACL 2021
Published: 2021
Full Text: View/download PDF

18. Predicting Individual Treatment Effects of Large-scale Team Competitions in a Ride-sharing Economy

Author: Teng Ye, Ning Luo, Qiaozhu Mei, Zhang Lulu, Wei Ai, Jieping Ye, and Lingyu Zhang
Subjects: Social and Information Networks (cs.SI), FOS: Computer and information sciences, Computer Science - Machine Learning, Average treatment effect, Computer science, Flexibility (personality), Machine Learning (stat.ML), Computer Science - Social and Information Networks, 02 engineering and technology, CONTEST, Machine Learning (cs.LG), Computer Science - Computers and Society, Sharing economy, Statistics - Machine Learning, 020204 information systems, Scale (social sciences), Computers and Society (cs.CY), 0202 electrical engineering, electronic engineering, information engineering, Revenue, 020201 artificial intelligence & image processing, Job satisfaction, Marketing, Social identity theory
Abstract: Millions of drivers worldwide have enjoyed financial benefits and work schedule flexibility through a ride-sharing economy, but meanwhile they have suffered from the lack of a sense of identity and career achievement. Equipped with social identity and contest theories, financially incentivized team competitions have been an effective instrument to increase drivers' productivity, job satisfaction, and retention, and to improve revenue over cost for ride-sharing platforms. While these competitions are overall effective, the decisive factors behind the treatment effects and how they affect the outcomes of individual drivers have been largely mysterious. In this study, we analyze data collected from more than 500 large-scale team competitions organized by a leading ride-sharing platform, building machine learning models to predict individual treatment effects. Through a careful investigation of features and predictors, we are able to reduce out-sample prediction error by more than 24%. Through interpreting the best-performing models, we discover many novel and actionable insights regarding how to optimize the design and the execution of team competitions on ride-sharing platforms. A simulated analysis demonstrates that by simply changing a few contest design options, the average treatment effect of a real competition is expected to increase by as much as 26%. Our procedure and findings shed light on how to analyze and optimize large-scale online field experiments in general., Accepted to KDD 2020
Published: 2020

19. Emoji-Powered Representation Learning for Cross-Lingual Sentiment Classification (Extended Abstract)

Author: Xuan Lu, Zhenpeng Chen, Qiaozhu Mei, Xuanzhe Liu, Ziniu Hu, and Sheng Shen
Subjects: Cross lingual, Computer science, Emoji, business.industry, Artificial intelligence, computer.software_genre, business, Feature learning, computer, Natural language processing
Abstract: Sentiment classification typically relies on a large amount of labeled data. In practice, the availability of labels is highly imbalanced among different languages. To tackle this problem, cross-lingual sentiment classification approaches aim to transfer knowledge learned from one language that has abundant labeled examples (i.e., the source language, usually English) to another language with fewer labels (i.e., the target language). The source and the target languages are usually bridged through off-the-shelf machine translation tools. Through such a channel, cross-language sentiment patterns can be successfully learned from English and transferred into the target languages. This approach, however, often fails to capture sentiment knowledge specific to the target language. In this paper, we employ emojis, which are widely available in many languages, as a new channel to learn both the cross-language and the language-specific sentiment patterns. We propose a novel representation learning method that uses emoji prediction as an instrument to learn respective sentiment-aware representations for each language. The learned representations are then integrated to facilitate cross-lingual sentiment classification.
Published: 2020
Full Text: View/download PDF

20. UMSIForeseer at SemEval-2020 Task 11: Propaganda Detection by Fine-Tuning BERT with Resampling and Ensemble Learning

Author: Qiaozhu Mei, Cristina Garbacea, and Yunzhe Jiang
Subjects: Propaganda techniques, Computer science, business.industry, Machine learning, computer.software_genre, Ensemble learning, SemEval, Task (project management), Categorization, Resampling, Artificial intelligence, Language model, business, computer
Abstract: We describe our participation at the SemEval 2020 “Detection of Propaganda Techniques in News Articles” - Techniques Classification (TC) task, designed to categorize textual fragments into one of the 14 given propaganda techniques. Our solution leverages pre-trained BERT models. We present our model implementations, evaluation results and analysis of these results. We also investigate the potential of combining language models with resampling and ensemble learning methods to deal with data imbalance and improve performance.
Published: 2020
Full Text: View/download PDF

21. Understanding Diverse Usage Patterns from Large-Scale Appstore-Service Profiles

Author: Feng Feng, Xuanzhe Liu, Hong Mei, Xuan Lu, Huoran Li, Tao Xie, and Qiaozhu Mei
Subjects: Mobile deep linking, business.industry, Computer science, 020206 networking & telecommunications, 020207 software engineering, 02 engineering and technology, App store, Electronic mail, World Wide Web, Empirical research, Software deployment, 0202 electrical engineering, electronic engineering, information engineering, Mobile telephony, Android (operating system), business, Mobile device, Software
Abstract: The prevalence of smart mobile devices has promoted the popularity of mobile applications (a.k.a. apps). Supporting mobility has become a promising trend in software engineering research. This article presents an empirical study of behavioral service profiles collected from millions of users whose devices are deployed with Wandoujia, a leading Android app-store service in China. The dataset of Wandoujia service profiles consists of two kinds of user behavioral data from using 0.28 million free Android apps, including (1) app management activities (i.e., downloading, updating, and uninstalling apps) from over 17 million unique users and (2) app network usage from over 6 million unique users. We explore multiple aspects of such behavioral data and present patterns of app usage. Based on the findings as well as derived knowledge, we also suggest some new open opportunities and challenges that can be explored by the research community, including app development, deployment, delivery, revenue, etc.
Published: 2018
Full Text: View/download PDF

22. Interactive medical word sense disambiguation through informed learning

Author: Qiaozhu Mei, Kai Zheng, Hua Xu, and Yue Wang
Subjects: 0301 basic medicine, Computer science, Active learning (machine learning), Health Informatics, Research and Applications, computer.software_genre, Vocabulary, Medical and Health Sciences, Medical Records, Interactive Learning, Domain (software engineering), Machine Learning, Medical Subject Headings, 03 medical and health sciences, Engineering, 0302 clinical medicine, Clinical Research, Information and Computing Sciences, 030212 general & internal medicine, Natural Language Processing, business.industry, Search engine indexing, Logistic Models, 030104 developmental biology, Learning curve, Test set, Metric (mathematics), Medicine, Domain knowledge, Artificial intelligence, business, computer, Algorithms, Medical Informatics, Natural language processing
Abstract: ObjectiveMedical word sense disambiguation (WSD) is challenging and often requires significant training with data labeled by domain experts. This work aims to develop an interactive learning algorithm that makes efficient use of expert’s domain knowledge in building high-quality medical WSD models with minimal human effort.MethodsWe developed an interactive learning algorithm with expert labeling instances and features. An expert can provide supervision in 3 ways: labeling instances, specifying indicative words of a sense, and highlighting supporting evidence in a labeled instance. The algorithm learns from these labels and iteratively selects the most informative instances to ask for future labels. Our evaluation used 3 WSD corpora: 198 ambiguous terms from Medical Subject Headings (MSH) as MEDLINE indexing terms, 74 ambiguous abbreviations in clinical notes from the University of Minnesota (UMN), and 24 ambiguous abbreviations in clinical notes from Vanderbilt University Hospital (VUH). For each ambiguous term and each learning algorithm, a learning curve that plots the accuracy on the test set against the number of labeled instances was generated. The area under the learning curve was used as the primary evaluation metric.ResultsOur interactive learning algorithm significantly outperformed active learning, the previous fastest learning algorithm for medical WSD. Compared to active learning, it achieved 90% accuracy for the MSH corpus with 42% less labeling effort, 35% less labeling effort for the UMN corpus, and 16% less labeling effort for the VUH corpus.ConclusionsHigh-quality WSD models can be efficiently trained with minimal supervision by inviting experts to label informative instances and provide domain knowledge through labeling/highlighting contextual features.
Published: 2018
Full Text: View/download PDF

23. Deriving User Preferences of Mobile Apps from Their Management Activities

Author: Jian Tang, Qiaozhu Mei, Wei Ai, Gang Huang, Huoran Li, Xuanzhe Liu, and Feng Feng
Subjects: GeneralLiterature_INTRODUCTORYANDSURVEY, Computer science, Process (engineering), business.industry, media_common.quotation_subject, Behavioral pattern, 020207 software engineering, 02 engineering and technology, General Business, Management and Accounting, Usage data, Computer Science Applications, World Wide Web, Upload, User experience design, 020204 information systems, mental disorders, 0202 electrical engineering, electronic engineering, information engineering, Quality (business), business, Mobile device, Host (network), Information Systems, media_common
Abstract: App marketplaces host millions of mobile apps that are downloaded billions of times. Investigating how people manage mobile apps in their everyday lives creates a unique opportunity to understand the behavior and preferences of mobile device users, infer the quality of apps, and improve user experience. Existing literature provides very limited knowledge about app management activities, due to the lack of app usage data at scale. This article takes the initiative to analyze a very large app management log collected through a leading Android app marketplace. The dataset covers 5 months of detailed downloading, updating, and uninstallation activities, which involve 17 million anonymized users and 1 million apps. We present a surprising finding that the metrics commonly used to rank apps in app stores do not truly reflect the users’ real attitudes. We then identify behavioral patterns from the app management activities that more accurately indicate user preferences of an app even when no explicit rating is available. A systematic statistical analysis is designed to evaluate machine learning models that are trained to predict user preferences using these behavioral patterns, which features an inverse probability weighting method to correct the selection biases in the training process.
Published: 2017
Full Text: View/download PDF

24. Development and empirical user-centered evaluation of semantically-based query recommendation for an electronic health record search engine

Author: Lei Yang, Kai Zheng, Katherine B. Murkowski-Steffy, Danny T. Y. Wu, V. G. Vinod Vydiswaran, Qiaozhu Mei, and David A. Hanauer
Subjects: 020205 medical informatics, Computer science, Information Storage and Retrieval, Health Informatics, 02 engineering and technology, Query language, Query optimization, Article, Ranking (information retrieval), 03 medical and health sciences, Query expansion, 0302 clinical medicine, Web query classification, 0202 electrical engineering, electronic engineering, information engineering, Electronic Health Records, Humans, 030212 general & internal medicine, Natural Language Processing, computer.programming_language, Information retrieval, Web search query, Concept search, Semantics, Computer Science Applications, Search Engine, computer, Algorithms, RDF query language
Abstract: Display Omitted A user-centered evaluation is conducted to assess the value of query recommendation.The feature is designed to facilitate retrieval of information from EHRs.The algorithm utilizes MetaMap to identify medical concepts.The performance is rated consistently higher with query recommendation turned on.Perceived usefulness and perceived ease of use scores are overwhelmingly positive. ObjectiveThe utility of biomedical information retrieval environments can be severely limited when users lack expertise in constructing effective search queries. To address this issue, we developed a computer-based query recommendation algorithm that suggests semantically interchangeable terms based on an initial user-entered query. In this study, we assessed the value of this approach, which has broad applicability in biomedical information retrieval, by demonstrating its application as part of a search engine that facilitates retrieval of information from electronic health records (EHRs). Materials and MethodsThe query recommendation algorithm utilizes MetaMap to identify medical concepts from search queries and indexed EHR documents. Synonym variants from UMLS are used to expand the concepts along with a synonym set curated from historical EHR search logs. The empirical study involved 33 clinicians and staff who evaluated the system through a set of simulated EHR search tasks. User acceptance was assessed using the widely used technology acceptance model. ResultsThe search engines performance was rated consistently higher with the query recommendation feature turned on vs. off. The relevance of computer-recommended search terms was also rated high, and in most cases the participants had not thought of these terms on their own. The questions on perceived usefulness and perceived ease of use received overwhelmingly positive responses. A vast majority of the participants wanted the query recommendation feature to be available to assist in their day-to-day EHR search tasks. Discussion and ConclusionChallenges persist for users to construct effective search queries when retrieving information from biomedical documents including those from EHRs. This study demonstrates that semantically-based query recommendation is a viable solution to addressing this challenge.
Published: 2017
Full Text: View/download PDF

25. Does team competition increase pro-social lending? Evidence from online microfinance

Author: Yang Liu, Qiaozhu Mei, Roy Chen, and Yan Chen
Subjects: Economics and Econometrics, Social psychology (sociology), Microfinance, business.industry, 05 social sciences, Control (management), Developing country, Public relations, law.invention, Competition (economics), Prosocial behavior, law, 0502 economics and business, Field research, 050207 economics, Marketing, business, Social identity theory, Finance, 050205 econometrics
Abstract: We investigate the effects of team competition on pro-social lending activity on Kiva.org , the first microlending website to match lenders with entrepreneurs in developing countries. Using naturally occurring field data, we find that lenders who join teams contribute 1.2 more loans ($30–$42) per month than those who do not. To further explore factors that differentiate successful teams from dormant ones, we run a large-scale randomized field experiment ( n = 22 , 233 ) by posting forum messages. Compared to the control, we find that lenders make significantly more loans when exposed to a goal-setting and coordination message, whereas goal-setting alone significantly increases lending activities of previously inactive teams. Our findings suggest that goal-setting and coordination are effective mechanisms to increase pro-social behavior in teams.
Published: 2017
Full Text: View/download PDF

26. Cost-aware active learning for named entity recognition in clinical text

Author: Joshua C. Denny, Trevor Cohen, Hua Xu, Qiang Wei, Qiaozhu Mei, Qingxia Chen, Amy Franklin, Yukun Chen, Thomas A. Lasko, Stephen Wu, and Mandana Salimi
Subjects: Big Data, Active learning (machine learning), Computer science, Information Storage and Retrieval, Health Informatics, Sample (statistics), 02 engineering and technology, computer.software_genre, Machine learning, Research and Applications, Task (project management), 03 medical and health sciences, Annotation, 0302 clinical medicine, Named-entity recognition, 0202 electrical engineering, electronic engineering, information engineering, Electronic Health Records, Humans, Computer Simulation, 030212 general & internal medicine, Natural Language Processing, business.industry, Data set, Models, Economic, Learning curve, Passive learning, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Algorithms
Abstract: Objective Active Learning (AL) attempts to reduce annotation cost (ie, time) by selecting the most informative examples for annotation. Most approaches tacitly (and unrealistically) assume that the cost for annotating each sample is identical. This study introduces a cost-aware AL method, which simultaneously models both the annotation cost and the informativeness of the samples and evaluates both via simulation and user studies. Materials and Methods We designed a novel, cost-aware AL algorithm (Cost-CAUSE) for annotating clinical named entities; we first utilized lexical and syntactic features to estimate annotation cost, then we incorporated this cost measure into an existing AL algorithm. Using the 2010 i2b2/VA data set, we then conducted a simulation study comparing Cost-CAUSE with noncost-aware AL methods, and a user study comparing Cost-CAUSE with passive learning. Results Our cost model fit empirical annotation data well, and Cost-CAUSE increased the simulation area under the learning curve (ALC) scores by up to 5.6% and 4.9%, compared with random sampling and alternate AL methods. Moreover, in a user annotation task, Cost-CAUSE outperformed passive learning on the ALC score and reduced annotation time by 20.5%–30.2%. Discussion Although AL has proven effective in simulations, our user study shows that a real-world environment is far more complex. Other factors have a noticeable effect on the AL method, such as the annotation accuracy of users, the tiredness of users, and even the physical and mental condition of users. Conclusion Cost-CAUSE saves significant annotation cost compared to random sampling.
Published: 2019

27. SEntiMoji: An Emoji-Powered Learning Approach for Sentiment Analysis in Software Engineering

Author: Zhenpeng Chen, Xuanzhe Liu, Yanbin Cao, Qiaozhu Mei, and Xuan Lu
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Emoji, Computer science, business.industry, Sentiment analysis, 020207 software engineering, 02 engineering and technology, Commit, Usage data, Software Engineering (cs.SE), Jargon, Computer Science - Software Engineering, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Labeled data, Software engineering, business, Classifier (UML), Feature learning, Computation and Language (cs.CL)
Abstract: Sentiment analysis has various application scenarios in software engineering (SE), such as detecting developers' emotions in commit messages and identifying their opinions on Q&A forums. However, commonly used out-of-the-box sentiment analysis tools cannot obtain reliable results on SE tasks and the misunderstanding of technical jargon is demonstrated to be the main reason. Then, researchers have to utilize labeled SE-related texts to customize sentiment analysis for SE tasks via a variety of algorithms. However, the scarce labeled data can cover only very limited expressions and thus cannot guarantee the analysis quality. To address such a problem, we turn to the easily available emoji usage data for help. More specifically, we employ emotional emojis as noisy labels of sentiments and propose a representation learning approach that uses both Tweets and GitHub posts containing emojis to learn sentiment-aware representations for SE-related texts. These emoji-labeled posts can not only supply the technical jargon, but also incorporate more general sentiment patterns shared across domains. They as well as labeled data are used to learn the final sentiment classifier. Compared to the existing sentiment analysis methods used in SE, the proposed approach can achieve significant improvement on representative benchmark datasets. By further contrast experiments, we find that the Tweets make a key contribution to the power of our approach. This finding informs future research not to unilaterally pursue the domain-specific resource, but try to transform knowledge from the open domain through ubiquitous signals such as emojis., Accepted by the 2019 ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2019). Please include ESEC/FSE in any citations
Published: 2019

28. Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification

Author: Zach Landis-Lewis, V. G. Vinod Vydiswaran, Karandeep Singh, Qiaozhu Mei, David A. Hanauer, and Chunhua Weng
Subjects: 020205 medical informatics, Computer science, Information Storage and Retrieval, Health Informatics, 02 engineering and technology, lcsh:Computer applications to medicine. Medical informatics, computer.software_genre, Inverted index, Health informatics, 03 medical and health sciences, 0302 clinical medicine, 0202 electrical engineering, electronic engineering, information engineering, Roman numerals, Information retrieval, Electronic Health Records, 030212 general & internal medicine, business.industry, Health Policy, Research, Natural language processing, Clinical Coding, Numbering, Computer Science Applications, Information extraction, Variable (computer science), Identification (information), Variation (linguistics), Lexical variation, lcsh:R858-859.7, Artificial intelligence, business, computer
Abstract: Background Numbers and numerical concepts appear frequently in free text clinical notes from electronic health records. Knowledge of the frequent lexical variations of these numerical concepts, and their accurate identification, is important for many information extraction tasks. This paper describes an analysis of the variation in how numbers and numerical concepts are represented in clinical notes. Methods We used an inverted index of approximately 100 million notes to obtain the frequency of various permutations of numbers and numerical concepts, including the use of Roman numerals, numbers spelled as English words, and invalid dates, among others. Overall, twelve types of lexical variants were analyzed. Results We found substantial variation in how these concepts were represented in the notes, including multiple data quality issues. We also demonstrate that not considering these variations could have substantial real-world implications for cohort identification tasks, with one case missing > 80% of potential patients. Conclusions Numbering within clinical notes can be variable, and not taking these variations into account could result in missing or inaccurate information for natural language processing and information retrieval tasks.
Published: 2019

29. Judge the Judges: A Large-Scale Evaluation Study of Neural Language Models for Online Review Generation

Author: Qiaozhu Mei, Cristina Garbacea, Shiyan Yan, and Samuel Carton
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer science, media_common.quotation_subject, Lexical diversity, Machine Learning (stat.ML), Context (language use), 02 engineering and technology, 010501 environmental sciences, computer.software_genre, 01 natural sciences, Machine Learning (cs.LG), Discriminative model, Statistics - Machine Learning, 0202 electrical engineering, electronic engineering, information engineering, Quality (business), 0105 earth and related environmental sciences, media_common, Computer Science - Computation and Language, business.industry, Natural language generation, Ranking, 020201 artificial intelligence & image processing, Language model, Metric (unit), Artificial intelligence, business, Computation and Language (cs.CL), computer, Natural language processing
Abstract: We conduct a large-scale, systematic study to evaluate the existing evaluation methods for natural language generation in the context of generating online product reviews. We compare human-based evaluators with a variety of automated evaluation procedures, including discriminative evaluators that measure how well machine-generated text can be distinguished from human-written text, as well as word overlap metrics that assess how similar the generated text compares to human-written references. We determine to what extent these different evaluators agree on the ranking of a dozen of state-of-the-art generators for online product reviews. We find that human evaluators do not correlate well with discriminative evaluators, leaving a bigger question of whether adversarial accuracy is the correct objective for natural language generation. In general, distinguishing machine-generated text is challenging even for human evaluators, and human decisions correlate better with lexical overlaps. We find lexical diversity an intriguing metric that is indicative of the assessments of different evaluators. A post-experiment survey of participants provides insights into how to evaluate and improve the quality of natural language generation systems.
Published: 2019
Full Text: View/download PDF

30. Recommending teams promotes prosocial lending in online microfinance

Author: Wei Ai, Qiaozhu Mei, Roy Chen, Webb Phillips, and Yan Chen
Subjects: Microfinance, Multidisciplinary, Group membership, business.industry, 05 social sciences, Social Sciences, Joins, Public relations, Recommender system, law.invention, Test (assessment), Intervention (law), Prosocial behavior, law, 0502 economics and business, 050207 economics, Marketing, Social identity theory, business, 050205 econometrics
Abstract: Significance With three billion people subsisting on the equivalent of $2.50 per day, alleviating poverty is one of the most urgent challenges facing the world today. One solution to this problem has been to encourage the growth of small enterprises through microlending. A successful innovation is represented by Kiva.org , which matches citizen lenders with low-income entrepreneurs in developing countries. To increase prosocial lending, we use a large-scale field experiment and machine-learning methods to recommend lending teams to lenders. We find that lenders who join a team contribute significantly more compared with those who do not. Our results suggest team recommendation can be an effective and low-cost behavioral mechanism to increase charitable contributions.
Published: 2016
Full Text: View/download PDF

31. Extractive Adversarial Networks: High-Recall Explanations for Identifying Personal Attacks in Social Media Posts

Author: Qiaozhu Mei, Paul Resnick, and Samuel Carton
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Recall, Computer science, Machine Learning (stat.ML), 02 engineering and technology, Data science, Term (time), Domain (software engineering), Machine Learning (cs.LG), Computer Science - Information Retrieval, Adversarial system, Statistics - Machine Learning, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Social media, Set (psychology), Classifier (UML), Computation and Language (cs.CL), Information Retrieval (cs.IR)
Abstract: We introduce an adversarial method for producing high-recall explanations of neural text classifier decisions. Building on an existing architecture for extractive explanations via hard attention, we add an adversarial layer which scans the residual of the attention for remaining predictive signal. Motivated by the important domain of detecting personal attacks in social media comments, we additionally demonstrate the importance of manually setting a semantically appropriate `default' behavior for the model by explicitly manipulating its bias term. We develop a validation set of human-annotated personal attacks to evaluate the impact of these changes., Accepted to EMNLP 2018 Code and data available at https://github.com/shcarton/rcnn
Published: 2018

32. Identify Shifts of Word Semantics through Bayesian Surprise

Author: Fei Wu, Qiaozhu Mei, Cheng Li, Zhe Zhao, and Zhuofeng Wu
Subjects: Structure (mathematical logic), Span (category theory), business.industry, Computer science, Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing), 02 engineering and technology, Semantics, computer.software_genre, Margin (machine learning), Dynamics (music), 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Natural language processing, Period (music), Word (computer architecture)
Abstract: Much work has been done recently on learning word embeddings from large corpora, which attempts to find the coordinates of words in a static and high dimensional semantic space. In reality, such corpora often span a sufficiently long time period, during which the meanings of many words may have changed. The co-evolution of word meanings may also result in a distortion of the semantic space, making these static embeddings unable to accurately represent the dynamics of semantics. In this paper, we present a novel computational method to capture such changes and to model the evolution of word semantics. Distinct from existing approaches that learn word embeddings independently from time periods and then align them, our method explicitly establishes the stable topological structure of word semantics and identifies the surprising changes in the semantic space over time through a principled statistical method. Empirical experiments on large-scale real-world corpora demonstrate the effectiveness of the proposed approach, which outperforms the state-of-the-art by a large margin.
Published: 2018
Full Text: View/download PDF

33. Joint Modeling of Text and Networks for Cascade Prediction

Author: Cheng Li, Xiaoxiao Guo, and Qiaozhu Mei
Abstract: A critical research problem about information cascades, which is a central topic of social network analysis, is to predict the potential influence or the future growth of cascades. Recent developments of deep learning have provided promising alternatives, which no longer rely on heavy feature engineering efforts and instead learn the representation of cascade graphs in an end-to-end manner. In reality, however, the influence of a cascade not only depends on the cascade graph and the global network structure, but also largely relies on the content of the cascade and the preferences of users. In this work, we extend the deep learning approaches to cascade prediction by jointly modeling the content and the structure of cascades. We find that text information provides a valuable addition for the learning of cascade graphs, especially when some users (nodes) have rarely participated in the past cascades. To this end, a gating mechanism is introduced to dynamically fuse the structural and textual representations of nodes based on their respective properties. Attentions are employed to incorporate the text information associated with both cascade items and nodes. Empirical experiments demonstrate that incorporating text information brings a significant improvement to cascade prediction, and that the proposed model outperforms alternative ways to combine text and networks.
Published: 2018
Full Text: View/download PDF

34. A study of active learning methods for named entity recognition in clinical text

Author: Joshua C. Denny, Thomas A. Lasko, Yukun Chen, Hua Xu, and Qiaozhu Mei
Subjects: Active learning, Active learning (machine learning), Computer science, Health Informatics, 02 engineering and technology, Semi-supervised learning, Machine learning, computer.software_genre, Article, Machine Learning, 03 medical and health sciences, symbols.namesake, Annotation, 0302 clinical medicine, Clinical natural language processing, Named-entity recognition, 0202 electrical engineering, electronic engineering, information engineering, Humans, Learning, 030212 general & internal medicine, Clinical named entity recognition, Natural Language Processing, business.industry, Sampling (statistics), Computer Science Applications, Learning curve, Passive learning, symbols, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Natural language processing, Gibbs sampling
Abstract: Display Omitted We developed novel active learning algorithms for clinical named entity recognition.Equal cost per sample is not a practical annotation cost assumption in this task.We evaluated methods based on two types of estimated annotation cost.To achieve 0.8 in F-measure, active learning could save 42% annotation cost in words.The actual benefit of active learning should be further evaluated in real time. ObjectivesNamed entity recognition (NER), a sequential labeling task, is one of the fundamental tasks for building clinical natural language processing (NLP) systems. Machine learning (ML) based approaches can achieve good performance, but they often require large amounts of annotated samples, which are expensive to build due to the requirement of domain experts in annotation. Active learning (AL), a sample selection approach integrated with supervised ML, aims to minimize the annotation cost while maximizing the performance of ML-based models. In this study, our goal was to develop and evaluate both existing and new AL methods for a clinical NER task to identify concepts of medical problems, treatments, and lab tests from the clinical notes. MethodsUsing the annotated NER corpus from the 2010 i2b2/VA NLP challenge that contained 349 clinical documents with 20,423 unique sentences, we simulated AL experiments using a number of existing and novel algorithms in three different categories including uncertainty-based, diversity-based, and baseline sampling strategies. They were compared with the passive learning that uses random sampling. Learning curves that plot performance of the NER model against the estimated annotation cost (based on number of sentences or words in the training set) were generated to evaluate different active learning and the passive learning methods and the area under the learning curve (ALC) score was computed. ResultsBased on the learning curves of F-measure vs. number of sentences, uncertainty sampling algorithms outperformed all other methods in ALC. Most diversity-based methods also performed better than random sampling in ALC. To achieve an F-measure of 0.80, the best method based on uncertainty sampling could save 66% annotations in sentences, as compared to random sampling. For the learning curves of F-measure vs. number of words, uncertainty sampling methods again outperformed all other methods in ALC. To achieve 0.80 in F-measure, in comparison to random sampling, the best uncertainty based method saved 42% annotations in words. But the best diversity based method reduced only 7% annotation effort. ConclusionIn the simulated setting, AL methods, particularly uncertainty-sampling based approaches, seemed to significantly save annotation cost for the clinical NER task. The actual benefit of active learning in clinical NER should be further evaluated in a real-time setting.
Published: 2015
Full Text: View/download PDF

35. Assessing the readability of ClinicalTrials.gov

Author: Lawrence C. An, V. G. Vinod Vydiswaran, Patricia M. Clark, Qiaozhu Mei, Qing T. Zeng, David A. Hanauer, Kai Zheng, Kevyn Collins-Thompson, Danny T. Y. Wu, and Joshua Proulx
Subjects: Vocabulary, Databases, Factual, 020205 medical informatics, Sentence length, Computer science, media_common.quotation_subject, Information Dissemination, Health Informatics, 02 engineering and technology, Research and Applications, computer.software_genre, 03 medical and health sciences, 0302 clinical medicine, MedlinePlus, Terminology as Topic, 0202 electrical engineering, electronic engineering, information engineering, 030212 general & internal medicine, internet.website, internet, media_common, Analysis of Variance, Clinical Trials as Topic, Consumer Health Information, business.industry, Subject (documents), Readability, Clinical trial, Comprehension, Artificial intelligence, business, computer, Algorithms, Natural language processing
Abstract: Objective ClinicalTrials.gov serves critical functions of disseminating trial information to the public and helping the trials recruit participants. This study assessed the readability of trial descriptions at ClinicalTrials.gov using multiple quantitative measures.Materials and Methods The analysis included all 165 988 trials registered at ClinicalTrials.gov as of April 30, 2014. To obtain benchmarks, the authors also analyzed 2 other medical corpora: (1) all 955 Health Topics articles from MedlinePlus and (2) a random sample of 100 000 clinician notes retrieved from an electronic health records system intended for conveying internal communication among medical professionals. The authors characterized each of the corpora using 4 surface metrics, and then applied 5 different scoring algorithms to assess their readability. The authors hypothesized that clinician notes would be most difficult to read, followed by trial descriptions and MedlinePlus Health Topics articles.Results Trial descriptions have the longest average sentence length (26.1 words) across all corpora; 65% of their words used are not covered by a basic medical English dictionary. In comparison, average sentence length of MedlinePlus Health Topics articles is 61% shorter, vocabulary size is 95% smaller, and dictionary coverage is 46% higher. All 5 scoring algorithms consistently rated CliniclTrials.gov trial descriptions the most difficult corpus to read, even harder than clinician notes. On average, it requires 18 years of education to properly understand these trial descriptions according to the results generated by the readability assessment algorithms.Discussion and Conclusion Trial descriptions at CliniclTrials.gov are extremely difficult to read. Significant work is warranted to improve their readability in order to achieve CliniclTrials.gov’s goal of facilitating information dissemination and subject recruitment.
Published: 2015
Full Text: View/download PDF

36. Extracting relations from traditional Chinese medicine literature via heterogeneous entity networks

Author: Huaiyu Wan, Jie Tang, Qiaozhu Mei, Walter Luyten, Marie-Francine Moens, Lu Liu, and Xuezhong Zhou
Subjects: 0301 basic medicine, Support Vector Machine, Relation (database), Exploit, Computer science, Datasets as Topic, Information Storage and Retrieval, Inference, Health Informatics, Research and Applications, computer.software_genre, Relationship extraction, Support vector machine, 03 medical and health sciences, 030104 developmental biology, Classifier (linguistics), Humans, Enhanced Data Rates for GSM Evolution, Data mining, Medicine, Chinese Traditional, computer, Factor graph
Abstract: OBJECTIVE: Traditional Chinese medicine (TCM) is a unique and complex medical system that has developed over thousands of years. This article studies the problem of automatically extracting meaningful relations of entities from TCM literature, for the purposes of assisting clinical treatment or poly-pharmacology research and promoting the understanding of TCM in Western countries. METHODS: Instead of separately extracting each relation from a single sentence or document, we propose to collectively and globally extract multiple types of relations (eg, herb-syndrome, herb-disease, formula-syndrome, formula-disease, and syndrome-disease relations) from the entire corpus of TCM literature, from the perspective of network mining. In our analysis, we first constructed heterogeneous entity networks from the TCM literature, in which each edge is a candidate relation, then used a heterogeneous factor graph model (HFGM) to simultaneously infer the existence of all the edges. We also employed a semi-supervised learning algorithm estimate the model's parameters. RESULTS: We performed our method to extract relations from a large dataset consisting of more than 100,000 TCM article abstracts. Our results show that the performance of the HFGM at extracting all types of relations from TCM literature was significantly better than a traditional support vector machine (SVM) classifier (increasing the average precision by 11.09%, the recall by 13.83%, and the F1-measure by 12.47% for different types of relations, compared with a traditional SVM classifier). CONCLUSION: This study exploits the power of collective inference and proposes an HFGM based on heterogeneous entity networks, which significantly improved our ability to extract relations from TCM literature. ispartof: Journal of the American Medical Informatics Association vol:23 issue:2 pages:356-365 ispartof: location:England status: published
Published: 2015
Full Text: View/download PDF

37. DeepCas

Author: Cheng Li, Qiaozhu Mei, Jiaqi Ma, and Xiaoxiao Guo
Subjects: Social and Information Networks (cs.SI), FOS: Computer and information sciences, Graph kernel, Theoretical computer science, Social network, business.industry, Computer science, Node (networking), Deep learning, Computer Science - Social and Information Networks, 02 engineering and technology, Machine learning, computer.software_genre, Machine Learning (cs.LG), Computer Science - Learning, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Graph (abstract data type), 020201 artificial intelligence & image processing, Artificial intelligence, Information cascade, Heuristics, Representation (mathematics), business, computer
Abstract: Information cascades, effectively facilitated by most social network platforms, are recognized as a major factor in almost every social success and disaster in these networks. Can cascades be predicted? While many believe that they are inherently unpredictable, recent work has shown that some key properties of information cascades, such as size, growth, and shape, can be predicted by a machine learning algorithm that combines many features. These predictors all depend on a bag of hand-crafting features to represent the cascade network and the global network structure. Such features, always carefully and sometimes mysteriously designed, are not easy to extend or to generalize to a different platform or domain. Inspired by the recent successes of deep learning in multiple data mining tasks, we investigate whether an end-to-end deep learning approach could effectively predict the future size of cascades. Such a method automatically learns the representation of individual cascade graphs in the context of the global network structure, without hand-crafted features and heuristics. We find that node embeddings fall short of predictive power, and it is critical to learn the representation of a cascade graph as a whole. We present algorithms that learn the representation of cascade graphs in an end-to-end manner, which significantly improve the performance of cascade prediction over strong baselines that include feature based methods, node embedding methods, and graph kernel methods. Our results also provide interesting implications for cascade prediction in general.
Published: 2017
Full Text: View/download PDF

38. Deep Memory Networks for Attitude Identification

Author: Cheng Li, Qiaozhu Mei, and Xiaoxiao Guo
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Polarity (physics), Computer science, business.industry, Deep learning, 02 engineering and technology, Machine learning, computer.software_genre, Semantics, Task (project management), Identification (information), 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, Set (psychology), business, Computation and Language (cs.CL), computer
Abstract: We consider the task of identifying attitudes towards a given set of entities from text. Conventionally, this task is decomposed into two separate subtasks: target detection that identifies whether each entity is mentioned in the text, either explicitly or implicitly, and polarity classification that classifies the exact sentiment towards an identified entity (the target) into positive, negative, or neutral. Instead, we show that attitude identification can be solved with an end-to-end machine learning architecture, in which the two subtasks are interleaved by a deep memory network. In this way, signals produced in target detection provide clues for polarity classification, and reversely, the predicted polarity provides feedback to the identification of targets. Moreover, the treatments for the set of targets also influence each other -- the learned representations may share the same semantics for some targets but vary for others. The proposed deep memory network, the AttNet, outperforms methods that do not consider the interactions between the subtasks or those among the targets, including conventional machine learning methods and the state-of-the-art deep learning models., Accepted to WSDM'17
Published: 2017

39. Through a Gender Lens: Learning Usage Patterns of Emojis from Large-Scale Android Users

Author: Xuanzhe Liu, Zhenpeng Chen, Qiaozhu Mei, Wei Ai, Xuan Lu, and Huoran Li
Subjects: User information, FOS: Computer and information sciences, Information retrieval, Computer science, Emoji, Computer Science - Human-Computer Interaction, 0202 electrical engineering, electronic engineering, information engineering, Inference, 020207 software engineering, 020201 artificial intelligence & image processing, 02 engineering and technology, Android (operating system), Human-Computer Interaction (cs.HC)
Abstract: Based on a large data set of emoji using behavior collected from smartphone users over the world, this paper investigates gender-specific usage of emojis. We present various interesting findings that evidence a considerable difference in emoji usage by female and male users. Such a difference is significant not just in a statistical sense; it is sufficient for a machine learning algorithm to accurately infer the gender of a user purely based on the emojis used in their messages. In real world scenarios where gender inference is a necessity, models based on emojis have unique advantages over existing models that are based on textual or contextual information. Emojis not only provide language-independent indicators, but also alleviate the risk of leaking private user information through the analysis of text and metadata., Comment: The Web Conference 2018 (WWW 2018)
Published: 2017
Full Text: View/download PDF

40. Applying active learning to supervised word sense disambiguation in MEDLINE

Author: Hongxin Cao, Kai Zheng, Qiaozhu Mei, Yukun Chen, and Hua Xu
Subjects: Support Vector Machine, business.industry, Computer science, MEDLINE, Health Informatics, Problem-Based Learning, Semi-supervised learning, Research and Applications, Machine learning, computer.software_genre, Support vector machine, Annotation, Problem-based learning, Artificial Intelligence, Learning curve, Test set, Metric (mathematics), Active learning, Artificial intelligence, business, computer, Algorithms, Natural language processing
Abstract: Objectives This study was to assess whether active learning strategies can be integrated with supervised word sense disambiguation (WSD) methods, thus reducing the number of annotated samples, while keeping or improving the quality of disambiguation models. Methods We developed support vector machine (SVM) classifiers to disambiguate 197 ambiguous terms and abbreviations in the MSH WSD collection. Three different uncertainty sampling-based active learning algorithms were implemented with the SVM classifiers and were compared with a passive learner (PL) based on random sampling. For each ambiguous term and each learning algorithm, a learning curve that plots the accuracy computed from the test set as a function of the number of annotated samples used in the model was generated. The area under the learning curve (ALC) was used as the primary metric for evaluation. Results Our experiments demonstrated that active learners (ALs) significantly outperformed the PL, showing better performance for 177 out of 197 (89.8%) WSD tasks. Further analysis showed that to achieve an average accuracy of 90%, the PL needed 38 annotated samples, while the ALs needed only 24, a 37% reduction in annotation effort. Moreover, we analyzed cases where active learning algorithms did not achieve superior performance and identified three causes: (1) poor models in the early learning stage; (2) easy WSD cases; and (3) difficult WSD cases, which provide useful insight for future improvements. Conclusions This study demonstrated that integrating active learning strategies with supervised WSD methods could effectively reduce annotation cost and improve the disambiguation models.
Published: 2013
Full Text: View/download PDF

41. Visualizing Large-scale and High-dimensional Data

Author: Ming Zhang, Qiaozhu Mei, Jingzhou Liu, and Jian Tang
Subjects: FOS: Computer and information sciences, 0301 basic medicine, Clustering high-dimensional data, Computer science, Computer Science - Human-Computer Interaction, Statistical model, Scale (descriptive set theory), 02 engineering and technology, Graph, Machine Learning (cs.LG), Human-Computer Interaction (cs.HC), Data set, 03 medical and health sciences, Computer Science - Learning, 030104 developmental biology, Data point, Stochastic gradient descent, 0202 electrical engineering, electronic engineering, information engineering, Graph (abstract data type), 020201 artificial intelligence & image processing, Time complexity, Algorithm
Abstract: We study the problem of visualizing large-scale and high-dimensional data in a low-dimensional (typically 2D or 3D) space. Much success has been reported recently by techniques that first compute a similarity structure of the data points and then project them into a low-dimensional space with the structure preserved. These two steps suffer from considerable computational costs, preventing the state-of-the-art methods such as the t-SNE from scaling to large-scale and high-dimensional data (e.g., millions of data points and hundreds of dimensions). We propose the LargeVis, a technique that first constructs an accurately approximated K-nearest neighbor graph from the data and then layouts the graph in the low-dimensional space. Comparing to t-SNE, LargeVis significantly reduces the computational cost of the graph construction step and employs a principled probabilistic model for the visualization step, the objective of which can be effectively optimized through asynchronous stochastic gradient descent with a linear time complexity. The whole procedure thus easily scales to millions of high-dimensional data points. Experimental results on real-world data sets demonstrate that the LargeVis outperforms the state-of-the-art methods in both efficiency and effectiveness. The hyper-parameters of LargeVis are also much more stable over different data sets., WWW 2016
Published: 2016

42. BeeSpace Navigator: exploratory analysis of gene function using semantic indexing of biological literature

Author: Xin He, Bruce R. Schatz, ChengXiang Zhai, Radhika S. Khetani, Qiaozhu Mei, Moushumi Sen Sarma, Brant W. Chee, Jing Jiang, Xu Ling, and David Arcoleo
Subjects: Abstracting and Indexing, Interface (Java), MEDLINE, media_common.quotation_subject, Biology, Semantics, Login, computer.software_genre, Bioinformatics, Task (project management), 03 medical and health sciences, Software, Genetics, Animals, Function (engineering), 030304 developmental biology, media_common, Internet, 0303 health sciences, business.industry, 05 social sciences, Search engine indexing, Articles, Expression (mathematics), Genes, Artificial intelligence, 0509 other social sciences, 050904 information & library sciences, business, computer, Natural language processing
Abstract: With the rapid decrease in cost of genome sequencing, the classification of gene function is becoming a primary problem. Such classification has been performed by human curators who read biological literature to extract evidence. BeeSpace Navigator is a prototype software for exploratory analysis of gene function using biological literature. The software supports an automatic analogue of the curator process to extract functions, with a simple interface intended for all biologists. Since extraction is done on selected collections that are semantically indexed into conceptual spaces, the curation can be task specific. Biological literature containing references to gene lists from expression experiments can be analyzed to extract concepts that are computational equivalents of a classification such as Gene Ontology, yielding discriminating concepts that differentiate gene mentions from other mentions. The functions of individual genes can be summarized from sentences in biological literature, to produce results resembling a model organism database entry that is automatically computed. Statistical frequency analysis based on literature phrase extraction generates offline semantic indexes to support these gene function services. The website with BeeSpace Navigator is free and open to all; there is no login requirement at www.beespace.illinois.edu for version 4. Materials from the 2010 BeeSpace Software Training Workshop are available at www.beespace.illinois.edu/bstwmaterials.php.
Published: 2011
Full Text: View/download PDF

43. Characterizing Smartphone Usage Patterns from Millions of Android Users

Author: Kaigui Bian, Xuanzhe Liu, Tao Xie, Huoran Li, Feng Feng, Qiaozhu Mei, Xuan Lu, and Felix Xiaozhu Lin
Subjects: World Wide Web, Engineering, ComputerSystemsOrganization_COMPUTERSYSTEMIMPLEMENTATION, Mobile deep linking, GeneralLiterature_INTRODUCTORYANDSURVEY, business.industry, mental disorders, Internet privacy, Mobile apps, Android (operating system), business, GeneralLiterature_MISCELLANEOUS
Abstract: he prevalence of smart devices has promoted the popular- ity of mobile applications (a.k.a. apps) in recent years. A number of interesting and important questions remain unan- swered, such as why a user likes/dislikes an app, how an app becomes popular or eventually perishes, how a user selects apps to install and interacts with them, how frequently an app is used and how much traffic it generates, etc. This paper presents an empirical analysis of app usage behaviors collected from millions of users of Wandoujia, a leading An- droid app marketplace in China. The dataset covers two types of user behaviors of using over 0.2 million Android apps, including (1) app management activities (i.e., installa- tion, updating, and uninstallation) of over 0.8 million unique users and (2) app network traffic from over 2 million unique users. We explore multiple aspects of such behavior data and present interesting patterns of app usage. The results provide many useful implications to the developers, users, and disseminators of mobile apps.
Published: 2015
Full Text: View/download PDF

44. Click-through Prediction for Advertising in Twitter Timeline

Author: Cheng Li, Qiaozhu Mei, Sandeep Pandey, Dong Wang, and Yue Lu
Subjects: Data stream, World Wide Web, Search engine, business.industry, Computer science, Timeline, Static web page, Advertising, Context (language use), Contextual advertising, Click-through rate, business, Online advertising
Abstract: We present the problem of click-through prediction for advertising in Twitter timeline, which displays a stream of Tweets from accounts a user choose to follow. Traditional computational advertising usually appears in two forms: sponsored search that places ads onto the search result page when a query is issued to a search engine, and contextual advertising that places ads onto a regular, usually static Web page. Compared with these two paradigms, placing ads into a Tweet stream is particularly challenging given the nature of the data stream: the context into which an ad can be placed updates dynamically and never replicates. Every ad is therefore placed into a unique context. This makes the information available for training a machine learning model extremely sparse. In this study, we propose a learning-to-rank method which not only addresses the sparsity of training signals but also can be trained and updated online. The proposed method is evaluated using both offline experiments and online A/B tests, which involve very large collections of Twitter data and real Twitter users. Results of the experiments prove the effectiveness and efficiency of our solution, and its superiority over the current production model adopted by Twitter.
Published: 2015
Full Text: View/download PDF

45. PTE: Predictive Text Embedding through Large-scale Heterogeneous Text Networks

Author: Qiaozhu Mei, Jian Tang, and Meng Qu
Subjects: FOS: Computer and information sciences, Computer science, 02 engineering and technology, Machine learning, computer.software_genre, Convolutional neural network, Machine Learning (cs.LG), 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Neural and Evolutionary Computing (cs.NE), Representation (mathematics), Computer Science - Computation and Language, business.industry, I.2.6, Deep learning, Computer Science - Neural and Evolutionary Computing, Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing), Computer Science - Learning, Embedding, 020201 artificial intelligence & image processing, Artificial intelligence, business, Computation and Language (cs.CL), computer, Feature learning, Word (computer architecture), Predictive text
Abstract: Unsupervised text embedding methods, such as Skip-gram and Paragraph Vector, have been attracting increasing attention due to their simplicity, scalability, and effectiveness. However, comparing to sophisticated deep learning architectures such as convolutional neural networks, these methods usually yield inferior results when applied to particular machine learning tasks. One possible reason is that these text embedding methods learn the representation of text in a fully unsupervised way, without leveraging the labeled information available for the task. Although the low dimensional representations learned are applicable to many different tasks, they are not particularly tuned for any task. In this paper, we fill this gap by proposing a semi-supervised representation learning method for text data, which we call the \textit{predictive text embedding} (PTE). Predictive text embedding utilizes both labeled and unlabeled data to learn the embedding of text. The labeled information and different levels of word co-occurrence information are first represented as a large-scale heterogeneous text network, which is then embedded into a low dimensional space through a principled and efficient algorithm. This low dimensional embedding not only preserves the semantic closeness of words and documents, but also has a strong predictive power for the particular task. Compared to recent supervised approaches based on convolutional neural networks, predictive text embedding is comparable or more effective, much more efficient, and has fewer parameters to tune., KDD 2015
Published: 2015

46. Mining consumer health vocabulary from community-generated text

Author: V G Vinod, Vydiswaran, Qiaozhu, Mei, David A, Hanauer, and Kai, Zheng
Subjects: Encyclopedias as Topic, Internet, Consumer Health Information, Terminology as Topic, Data Mining, Articles, Vocabulary
Abstract: Community-generated text corpora can be a valuable resource to extract consumer health vocabulary (CHV) and link them to professional terminologies and alternative variants. In this research, we propose a pattern-based text-mining approach to identify pairs of CHV and professional terms from Wikipedia, a large text corpus created and maintained by the community. A novel measure, leveraging the ratio of frequency of occurrence, was used to differentiate consumer terms from professional terms. We empirically evaluated the applicability of this approach using a large data sample consisting of MedLine abstracts and all posts from an online health forum, MedHelp. The results show that the proposed approach is able to identify synonymous pairs and label the terms as either consumer or professional term with high accuracy. We conclude that the proposed approach provides great potential to produce a high quality CHV to improve the performance of computational applications in processing consumer-generated health text.
Published: 2015

47. LINE: Large-scale Information Network Embedding

Author: Ming Zhang, Mingzhe Wang, Jun Yan, Jian Tang, Meng Qu, and Qiaozhu Mei
Subjects: FOS: Computer and information sciences, Theoretical computer science, Social network, Computer science, Graph embedding, business.industry, Node (networking), Visualization, Machine Learning (cs.LG), Computer Science - Learning, Stochastic gradient descent, Global network, Line (geometry), Embedding, business
Abstract: This paper studies the problem of embedding very large information networks into low-dimensional vector spaces, which is useful in many tasks such as visualization, node classification, and link prediction. Most existing graph embedding methods do not scale for real world information networks which usually contain millions of nodes. In this paper, we propose a novel network embedding method called the "LINE," which is suitable for arbitrary types of information networks: undirected, directed, and/or weighted. The method optimizes a carefully designed objective function that preserves both the local and global network structures. An edge-sampling algorithm is proposed that addresses the limitation of the classical stochastic gradient descent and improves both the effectiveness and the efficiency of the inference. Empirical experiments prove the effectiveness of the LINE on a variety of real-world information networks, including language networks, social networks, and citation networks. The algorithm is very efficient, which is able to learn the embedding of a network with millions of vertices and billions of edges in a few hours on a typical single machine. The source code of the LINE is available online., Comment: WWW 2015
Published: 2015
Full Text: View/download PDF

48. Applying MetaMap to Medline for identifying novel associations in a large clinical dataset: a feasibility analysis

Author: Naren Ramakrishnan, David A. Hanauer, Mohammed Saeed, Qiaozhu Mei, Kerby Shedden, Alan R. Aronson, and Kai Zheng
Subjects: Data source, Information retrieval, business.industry, MEDLINE, Unified Medical Language System, Health Informatics, Research and Applications, International Classification of Diseases, Medicine, Data Mining, Feasibility Studies, Humans, Relevance (information retrieval), Pairwise comparison, Medical diagnosis, business, Natural Language Processing
Abstract: Objective We describe experiments designed to determine the feasibility of distinguishing known from novel associations based on a clinical dataset comprised of International Classification of Disease, V.9 (ICD-9) codes from 1.6 million patients by comparing them to associations of ICD-9 codes derived from 20.5 million Medline citations processed using MetaMap. Associations appearing only in the clinical dataset, but not in Medline citations, are potentially novel. Methods Pairwise associations of ICD-9 codes were independently identified in both the clinical and Medline datasets, which were then compared to quantify their degree of overlap. We also performed a manual review of a subset of the associations to validate how well MetaMap performed in identifying diagnoses mentioned in Medline citations that formed the basis of the Medline associations. Results The overlap of associations based on ICD-9 codes in the clinical and Medline datasets was low: only 6.6% of the 3.1 million associations found in the clinical dataset were also present in the Medline dataset. Further, a manual review of a subset of the associations that appeared in both datasets revealed that cooccurring diagnoses from Medline citations do not always represent clinically meaningful associations. Discussion Identifying novel associations derived from large clinical datasets remains challenging. Medline as a sole data source for existing knowledge may not be adequate to filter out widely known associations. Conclusions In this study, novel associations were not readily identified. Further improvements in accuracy and relevance for tools such as MetaMap are needed to realize their expected utility.
Published: 2014

49. Applying multiple methods to assess the readability of a large corpus of medical documents

Author: Danny T Y, Wu, David A, Hanauer, Qiaozhu, Mei, Patricia M, Clark, Lawrence C, An, Jianbo, Lei, Joshua, Proulx, Qing, Zeng-Treitler, and Kai, Zheng
Subjects: Reading, Vocabulary, Controlled, Artificial Intelligence, MEDLINE, Humans, Documentation, Comprehension, MedlinePlus, Article, Natural Language Processing
Abstract: Medical documents provided to patients at the end of an episode of care, such as discharge summaries and referral letters, serve as an important vehicle to convey critical information to patients and families. Increasingly, healthcare institutions are also experimenting with granting patients direct electronic access to other types of clinical narratives that are not typically shared unless explicitly requested, such as progress notes. While these efforts have great potential to improve information transparency, their value can be severely diminished if patients are unable to read and thus unable to properly interpret the medical documents shared to them. In this study, we approached the problem by contrasting the 'readability' of two types of medical documents: referral letters vs. other genres of narrative clinician notes not explicitly intended for direct viewing by patients. To establish a baseline for comparison, we also computed readability scores of MedlinePlus articles - exemplars of fine patient education materials carefully crafted for lay audiences. We quantified document readability using four different measures. Differences in the results obtained through these measures are also discussed.
Published: 2013

50. Hedging their mets: the use of uncertainty terms in clinical documents and its potential implications when sharing the documents with patients

Author: David A, Hanauer, Yang, Liu, Qiaozhu, Mei, Frank J, Manion, Ulysses J, Balis, and Kai, Zheng
Subjects: Patient Access to Records, Medical Records Systems, Computerized, Physicians, Electronic Health Records, Humans, Articles, humanities, Language, Natural Language Processing
Abstract: In this study, we quantified the use of uncertainty expressions, referred to as ‘hedge’ phrases, among a corpus of 100,000 clinical documents retrieved from our institution’s electronic health record system. The frequency of each hedge phrase appearing in the corpus was characterized across document types and clinical departments. We also used a natural language processing tool to identify clinical concepts that were spatially, and potentially semantically, associated with the hedge phrases identified. The objective was to delineate the prevalence of hedge phrase usage in clinical documentation which may have a profound impact on patient care and provider–patient communication, and may become a source of unintended consequences when such documents are made directly accessible to patients via patient portals.
Published: 2013

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

83 results on '"Qiaozhu Mei"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources