Author: "Wu, Tongshuang" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Wu, Tongshuang"' showing total 172 results

Start Over Author "Wu, Tongshuang"

172 results on '"Wu, Tongshuang"'

1. Orbit: A Framework for Designing and Evaluating Multi-objective Rankers

Author: Yang, Chenyang, Xiao, Tesi, Shavlovsky, Michael, Kästner, Christian, and Wu, Tongshuang
Subjects: Computer Science - Human-Computer Interaction, Computer Science - Information Retrieval
Abstract: Machine learning in production needs to balance multiple objectives: This is particularly evident in ranking or recommendation models, where conflicting objectives such as user engagement, satisfaction, diversity, and novelty must be considered at the same time. However, designing multi-objective rankers is inherently a dynamic wicked problem -- there is no single optimal solution, and the needs evolve over time. Effective design requires collaboration between cross-functional teams and careful analysis of a wide range of information. In this work, we introduce Orbit, a conceptual framework for Objective-centric Ranker Building and Iteration. The framework places objectives at the center of the design process, to serve as boundary objects for communication and guide practitioners for design and evaluation. We implement Orbit as an interactive system, which enables stakeholders to interact with objective spaces directly and supports real-time exploration and evaluation of design trade-offs. We evaluate Orbit through a user study involving twelve industry practitioners, showing that it supports efficient design space exploration, leads to more informed decision-making, and enhances awareness of the inherent trade-offs of multiple objectives. Orbit (1) opens up new opportunities of an objective-centric design process for any multi-objective ML models, as well as (2) sheds light on future designs that push practitioners to go beyond a narrow metric-centric or example-centric mindset.
Published: 2024

2. HiMemFormer: Hierarchical Memory-Aware Transformer for Multi-Agent Action Anticipation

Author: Wang, Zirui, Zhao, Xinran, Stepputtis, Simon, Kim, Woojun, Wu, Tongshuang, Sycara, Katia, and Xie, Yaqi
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multiagent Systems
Abstract: Understanding and predicting human actions has been a long-standing challenge and is a crucial measure of perception in robotics AI. While significant progress has been made in anticipating the future actions of individual agents, prior work has largely overlooked a key aspect of real-world human activity -- interactions. To address this gap in human-like forecasting within multi-agent environments, we present the Hierarchical Memory-Aware Transformer (HiMemFormer), a transformer-based model for online multi-agent action anticipation. HiMemFormer integrates and distributes global memory that captures joint historical information across all agents through a transformer framework, with a hierarchical local memory decoder that interprets agent-specific features based on these global representations using a coarse-to-fine strategy. In contrast to previous approaches, HiMemFormer uniquely hierarchically applies the global context with agent-specific preferences to avoid noisy or redundant information in multi-agent action anticipation. Extensive experiments on various multi-agent scenarios demonstrate the significant performance of HiMemFormer, compared with other state-of-the-art methods.
Published: 2024

3. What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data Slicing

Author: Yang, Chenyang, Hong, Yining, Lewis, Grace A., Wu, Tongshuang, and Kästner, Christian
Subjects: Computer Science - Software Engineering, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Machine learning models make mistakes, yet sometimes it is difficult to identify the systematic problems behind the mistakes. Practitioners engage in various activities, including error analysis, testing, auditing, and red-teaming, to form hypotheses of what can go (or has gone) wrong with their models. To validate these hypotheses, practitioners employ data slicing to identify relevant examples. However, traditional data slicing is limited by available features and programmatic slicing functions. In this work, we propose SemSlicer, a framework that supports semantic data slicing, which identifies a semantically coherent slice, without the need for existing features. SemSlicer uses Large Language Models to annotate datasets and generate slices from any user-defined slicing criteria. We show that SemSlicer generates accurate slices with low cost, allows flexible trade-offs between different design dimensions, reliably identifies under-performing data slices, and helps practitioners identify useful data slices that reflect systematic problems.
Published: 2024

4. What You Say = What You Want? Teaching Humans to Articulate Requirements for LLMs

Author: Ma, Qianou, Peng, Weirui, Shen, Hua, Koedinger, Kenneth, and Wu, Tongshuang
Subjects: Computer Science - Human-Computer Interaction, Computer Science - Artificial Intelligence
Abstract: Prompting ChatGPT to achieve complex goals (e.g., creating a customer support chatbot) often demands meticulous prompt engineering, including aspects like fluent writing and chain-of-thought techniques. While emerging prompt optimizers can automatically refine many of these aspects, we argue that clearly conveying customized requirements (e.g., how to handle diverse inputs) remains a human-centric challenge. In this work, we introduce Requirement-Oriented Prompt Engineering (ROPE), a paradigm that focuses human attention on generating clear, complete requirements during prompting. We implement ROPE through an assessment and training suite that provides deliberate practice with LLM-generated feedback. In a study with 30 novices, we show that requirement-focused training doubles novices' prompting performance, significantly outperforming conventional prompt engineering training and prompt optimization. We also demonstrate that high-quality LLM outputs are directly tied to the quality of input requirements. Our work paves the way for more effective task delegation in human-LLM collaborative prompting., Comment: 15 pages, 5 figures
Published: 2024

5. SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning

Author: Zhao, Chenyang, Jia, Xueying, Viswanathan, Vijay, Wu, Tongshuang, and Neubig, Graham
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. However, prompting often leads models to make predictions with lower accuracy compared to finetuning a model with ample training data. On the other hand, while finetuning LLMs on task-specific data generally improves their performance, abundant annotated datasets are not available for all tasks. Previous work has explored generating task-specific data from state-of-the-art LLMs and using this data to finetune smaller models, but this approach requires access to a language model other than the one being trained, which introduces cost, scalability challenges, and legal hurdles associated with continuously relying on more powerful LLMs. In response to these, we propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM, then use these input-output pairs to finetune the student LLM itself. In our empirical evaluation of the Natural Instructions V2 benchmark, we find that SELF-GUIDE improves the performance of LLM by a substantial margin. Specifically, we report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics. This sheds light on the promise of self-synthesized data guiding LLMs towards becoming task-specific experts without any external learning signals., Comment: Accepted by COLM 2024
Published: 2024

6. Synthetic Multimodal Question Generation

Author: Wu, Ian, Jayanthi, Sravan, Viswanathan, Vijay, Rosenberg, Simon, Pakazad, Sina, Wu, Tongshuang, and Neubig, Graham
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Multimodal Retrieval Augmented Generation (MMRAG) is a powerful approach to question-answering over multimodal documents. A key challenge with evaluating MMRAG is the paucity of high-quality datasets matching the question styles and modalities of interest. In light of this, we propose SMMQG, a synthetic data generation framework. SMMQG leverages interplay between a retriever, large language model (LLM) and large multimodal model (LMM) to generate question and answer pairs directly from multimodal documents, with the questions conforming to specified styles and modalities. We use SMMQG to generate an MMRAG dataset of 1024 questions over Wikipedia documents and evaluate state-of-the-art models using it, revealing insights into model performance that are attainable only through style- and modality-specific evaluation data. Next, we measure the quality of data produced by SMMQG via a human study. We find that the quality of SMMQG-generated synthetic data is on par with the quality of the crowdsourced benchmark MMQA and that downstream evaluation results using both datasets strongly concur., Comment: Accepted to EMNLP 2024 Findings; Camera Ready
Published: 2024

7. WebCanvas: Benchmarking Web Agents in Online Environments

Author: Pan, Yichen, Kong, Dehan, Zhou, Sida, Cui, Cheng, Leng, Yifei, Jiang, Bing, Liu, Hangyu, Shang, Yanyi, Zhou, Shuyan, Wu, Tongshuang, and Wu, Zhengyang
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, 68T50, I.2.7
Abstract: For web agents to be practically useful, they must adapt to the continuously evolving web environment characterized by frequent updates to user interfaces and content. However, most existing benchmarks only capture the static aspects of the web. To bridge this gap, we introduce WebCanvas, an innovative online evaluation framework for web agents that effectively addresses the dynamic nature of web interactions. WebCanvas contains three main components to facilitate realistic assessments: (1) A novel evaluation metric which reliably capture critical intermediate actions or states necessary for task completions while disregarding noise caused by insignificant events or changed web-elements. (2) A benchmark dataset called Mind2Web-Live, a refined version of original Mind2Web static dataset containing 542 tasks with 2439 intermediate evaluation states; (3) Lightweight and generalizable annotation tools and testing pipelines that enables the community to collect and maintain the high-quality, up-to-date dataset. Building on WebCanvas, we open-source an agent framework with extensible modules for reasoning, providing a foundation for the community to conduct online inference and evaluations. Our best-performing agent achieves a task success rate of 23.1% and a task completion rate of 48.8% on the Mind2Web-Live test set. Additionally, we analyze the performance discrepancies across various websites, domains, and experimental environments. We encourage the community to contribute further insights on online agent evaluation, thereby advancing this field of research., Comment: Our platform, tool and dataset are publically available at https://www.imean.ai/web-canvas/ and https://huggingface.co/datasets/iMeanAI/Mind2Web-Live/
Published: 2024

8. Beyond Relevance: Evaluate and Improve Retrievers on Perspective Awareness

Author: Zhao, Xinran, Chen, Tong, Chen, Sihao, Zhang, Hongming, and Wu, Tongshuang
Subjects: Computer Science - Information Retrieval, Computer Science - Computation and Language
Abstract: The task of Information Retrieval (IR) requires a system to identify relevant documents based on users' information needs. In real-world scenarios, retrievers are expected to not only rely on the semantic relevance between the documents and the queries but also recognize the nuanced intents or perspectives behind a user query. For example, when asked to verify a claim, a retrieval system is expected to identify evidence from both supporting vs. contradicting perspectives, for the downstream system to make a fair judgment call. In this work, we study whether retrievers can recognize and respond to different perspectives of the queries -- beyond finding relevant documents for a claim, can retrievers distinguish supporting vs. opposing documents? We reform and extend six existing tasks to create a benchmark for retrieval, where we have diverse perspectives described in free-form text, besides root, neutral queries. We show that current retrievers covered in our experiments have limited awareness of subtly different perspectives in queries and can also be biased toward certain perspectives. Motivated by the observation, we further explore the potential to leverage geometric features of retriever representation space to improve the perspective awareness of retrievers in a zero-shot manner. We demonstrate the efficiency and effectiveness of our projection-based methods on the same set of tasks. Further analysis also shows how perspective awareness improves performance on various downstream tasks, with 4.2% higher accuracy on AmbigQA and 29.9% more correlation with designated viewpoints on essay writing, compared to non-perspective-aware baselines.
Published: 2024

9. Better Synthetic Data by Retrieving and Transforming Existing Datasets

Author: Gandhi, Saumya, Gala, Ritu, Viswanathan, Vijay, Wu, Tongshuang, and Neubig, Graham
Subjects: Computer Science - Computation and Language
Abstract: Despite recent advances in large language models, building dependable and deployable NLP models typically requires abundant, high-quality training data. However, task-specific data is not available for many use cases, and manually curating task-specific data is labor-intensive. Recent work has studied prompt-driven synthetic data generation using large language models, but these generated datasets tend to lack complexity and diversity. To address these limitations, we introduce a method, DataTune, to make better use of existing, publicly available datasets to improve automatic dataset generation. DataTune performs dataset transformation, enabling the repurposing of publicly available datasets into a format that is directly aligned with the specific requirements of target tasks. On a diverse set of language-based tasks from the BIG-Bench benchmark, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49% and improves over existing methods that use synthetic or retrieved training data by 34%. We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks. We integrate DataTune into an open-source repository to make this method accessible to the community: https://github.com/neulab/prompt2model., Comment: PDF fixed in v3
Published: 2024

10. Evaluating Mathematical Reasoning Beyond Accuracy

Author: Xia, Shijie, Li, Xuefeng, Liu, Yixin, Wu, Tongshuang, and Liu, Pengfei
Subjects: Computer Science - Computation and Language
Abstract: The leaderboard of Large Language Models (LLMs) in mathematical tasks has been continuously updated. However, the majority of evaluations focus solely on the final results, neglecting the quality of the intermediate steps. This oversight can mask underlying problems, such as logical errors or unnecessary steps in the reasoning process. To measure reasoning beyond final-answer accuracy, we introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps. ReasonEval employs $\textit{validity}$ and $\textit{redundancy}$ to characterize the reasoning quality, as well as accompanying LLMs to assess them automatically. Instantiated by base models that possess strong mathematical knowledge and trained with high-quality labeled data, ReasonEval achieves state-of-the-art performance on human-labeled datasets and can accurately detect different types of errors generated by perturbation. When applied to evaluate LLMs specialized in math, we find that an increase in final-answer accuracy does not necessarily guarantee an improvement in the overall quality of the reasoning steps for challenging mathematical problems. Additionally, we observe that ReasonEval can play a significant role in data selection. We release the best-performing model, meta-evaluation script, and all evaluation results at https://github.com/GAIR-NLP/ReasonEval.
Published: 2024

11. Fact-and-Reflection (FaR) Improves Confidence Calibration of Large Language Models

Author: Zhao, Xinran, Zhang, Hongming, Pan, Xiaoman, Yao, Wenlin, Yu, Dong, Wu, Tongshuang, and Chen, Jianshu
Subjects: Computer Science - Computation and Language
Abstract: For a LLM to be trustworthy, its confidence level should be well-calibrated with its actual performance. While it is now common sense that LLM performances are greatly impacted by prompts, the confidence calibration in prompting LLMs has yet to be thoroughly explored. In this paper, we explore how different prompting strategies influence LLM confidence calibration and how it could be improved. We conduct extensive experiments on six prompting methods in the question-answering context and we observe that, while these methods help improve the expected LLM calibration, they also trigger LLMs to be over-confident when responding to some instances. Inspired by human cognition, we propose Fact-and-Reflection (FaR) prompting, which improves the LLM calibration in two steps. First, FaR elicits the known "facts" that are relevant to the input prompt from the LLM. And then it asks the model to "reflect" over them to generate the final answer. Experiments show that FaR prompting achieves significantly better calibration; it lowers the Expected Calibration Error by 23.5% on our multi-purpose QA tasks. Notably, FaR prompting even elicits the capability of verbally expressing concerns in less confident scenarios, which helps trigger retrieval augmentation for solving these harder instances., Comment: 17 pages, 10 figures
Published: 2024

12. Wikibench: Community-Driven Data Curation for AI Evaluation on Wikipedia

Author: Kuo, Tzu-Sheng, Halfaker, Aaron, Cheng, Zirui, Kim, Jiwoo, Wu, Meng-Hsin, Wu, Tongshuang, Holstein, Kenneth, and Zhu, Haiyi
Subjects: Computer Science - Human-Computer Interaction, Computer Science - Artificial Intelligence
Abstract: AI tools are increasingly deployed in community contexts. However, datasets used to evaluate AI are typically created by developers and annotators outside a given community, which can yield misleading conclusions about AI performance. How might we empower communities to drive the intentional design and curation of evaluation datasets for AI that impacts them? We investigate this question on Wikipedia, an online community with multiple AI-based content moderation tools deployed. We introduce Wikibench, a system that enables communities to collaboratively curate AI evaluation datasets, while navigating ambiguities and differences in perspective through discussion. A field study on Wikipedia shows that datasets curated using Wikibench can effectively capture community consensus, disagreement, and uncertainty. Furthermore, study participants used Wikibench to shape the overall data curation process, including refining label definitions, determining data inclusion criteria, and authoring data statements. Based on our findings, we propose future directions for systems that support community-driven data curation.
Published: 2024
Full Text: View/download PDF

13. A large-scale audit of dataset licensing and attribution in AI

Author: Longpre, Shayne, Mahari, Robert, Chen, Anthony, Obeng-Marnu, Naana, Sileo, Damien, Brannon, William, Muennighoff, Niklas, Khazam, Nathan, Kabbara, Jad, Perisetla, Kartik, Wu, Xinyi (Alexis), Shippole, Enrico, Bollacker, Kurt, Wu, Tongshuang, Villa, Luis, Pentland, Sandy, and Hooker, Sara
Published: 2024
Full Text: View/download PDF

14. Measuring Adversarial Datasets

Author: Bai, Yuanchen, Huang, Raoyi, Viswanathan, Vijay, Kuo, Tzu-Sheng, and Wu, Tongshuang
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Human-Computer Interaction
Abstract: In the era of widespread public use of AI systems across various domains, ensuring adversarial robustness has become increasingly vital to maintain safety and prevent undesirable errors. Researchers have curated various adversarial datasets (through perturbations) for capturing model deficiencies that cannot be revealed in standard benchmark datasets. However, little is known about how these adversarial examples differ from the original data points, and there is still no methodology to measure the intended and unintended consequences of those adversarial transformations. In this research, we conducted a systematic survey of existing quantifiable metrics that describe text instances in NLP tasks, among dimensions of difficulty, diversity, and disagreement. We selected several current adversarial effect datasets and compared the distributions between the original and their adversarial counterparts. The results provide valuable insights into what makes these datasets more challenging from a metrics perspective and whether they align with underlying assumptions., Comment: ART of Safety workshop (AACL 2023)
Published: 2023

15. The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI

Author: Longpre, Shayne, Mahari, Robert, Chen, Anthony, Obeng-Marnu, Naana, Sileo, Damien, Brannon, William, Muennighoff, Niklas, Khazam, Nathan, Kabbara, Jad, Perisetla, Kartik, Wu, Xinyi, Shippole, Enrico, Bollacker, Kurt, Wu, Tongshuang, Villa, Luis, Pentland, Sandy, and Hooker, Sara
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: The race to train language models on vast, diverse, and inconsistently documented datasets has raised pressing concerns about the legal and ethical risks for practitioners. To remedy these practices threatening data transparency and understanding, we convene a multi-disciplinary effort between legal and machine learning experts to systematically audit and trace 1800+ text datasets. We develop tools and standards to trace the lineage of these datasets, from their source, creators, series of license conditions, properties, and subsequent use. Our landscape analysis highlights the sharp divides in composition and focus of commercially open vs closed datasets, with closed datasets monopolizing important categories: lower resource languages, more creative tasks, richer topic variety, newer and more synthetic training data. This points to a deepening divide in the types of data that are made available under different license conditions, and heightened implications for jurisdictional legal interpretations of copyright and fair use. We also observe frequent miscategorization of licenses on widely used dataset hosting sites, with license omission of 70%+ and error rates of 50%+. This points to a crisis in misattribution and informed use of the most popular datasets driving many recent breakthroughs. As a contribution to ongoing improvements in dataset transparency and responsible use, we release our entire audit, with an interactive UI, the Data Provenance Explorer, which allows practitioners to trace and filter on data provenance for the most popular open source finetuning data collections: www.dataprovenance.org., Comment: 30 pages (18 main), 6 figures, 5 tables
Published: 2023

16. Beyond Testers' Biases: Guiding Model Testing with Knowledge Bases using LLMs

Author: Yang, Chenyang, Rustogi, Rishabh, Brower-Sinning, Rachel, Lewis, Grace A., Kästner, Christian, and Wu, Tongshuang
Subjects: Computer Science - Computation and Language, Computer Science - Software Engineering
Abstract: Current model testing work has mostly focused on creating test cases. Identifying what to test is a step that is largely ignored and poorly supported. We propose Weaver, an interactive tool that supports requirements elicitation for guiding model testing. Weaver uses large language models to generate knowledge bases and recommends concepts from them interactively, allowing testers to elicit requirements for further testing. Weaver provides rich external knowledge to testers and encourages testers to systematically explore diverse concepts beyond their own biases. In a user study, we show that both NLP experts and non-experts identified more, as well as more diverse concepts worth testing when using Weaver. Collectively, they found more than 200 failing test cases for stance detection with zero-shot ChatGPT. Our case studies further show that Weaver can help practitioners test models in real-world settings, where developers define more nuanced application scenarios (e.g., code understanding and transcript summarization) using LLMs.
Published: 2023

17. How to Teach Programming in the AI Era? Using LLMs as a Teachable Agent for Debugging

Author: Ma, Qianou, Shen, Hua, Koedinger, Kenneth, and Wu, Tongshuang
Subjects: Computer Science - Human-Computer Interaction, Computer Science - Software Engineering
Abstract: Large Language Models (LLMs) now excel at generative skills and can create content at impeccable speeds. However, they are imperfect and still make various mistakes. In a Computer Science education context, as these models are widely recognized as "AI pair programmers," it becomes increasingly important to train students on evaluating and debugging the LLM-generated code. In this work, we introduce HypoCompass, a novel system to facilitate deliberate practice on debugging, where human novices play the role of Teaching Assistants and help LLM-powered teachable agents debug code. We enable effective task delegation between students and LLMs in this learning-by-teaching environment: students focus on hypothesizing the cause of code errors, while adjacent skills like code completion are offloaded to LLM-agents. Our evaluations demonstrate that HypoCompass generates high-quality training materials (e.g., bugs and fixes), outperforming human counterparts fourfold in efficiency, and significantly improves student performance on debugging by 12% in the pre-to-post test., Comment: 14 pages, 6 figures
Published: 2023
Full Text: View/download PDF

18. From Nuisance to News Sense: Augmenting the News with Cross-Document Evidence and Context

Author: Milbauer, Jeremiah, Ding, Ziqi, Wu, Zhijin, and Wu, Tongshuang
Subjects: Computer Science - Computation and Language, Computer Science - Human-Computer Interaction
Abstract: Reading and understanding the stories in the news is increasingly difficult. Reporting on stories evolves rapidly, politicized news venues offer different perspectives (and sometimes different facts), and misinformation is rampant. However, existing solutions merely aggregate an overwhelming amount of information from heterogenous sources, such as different news outlets, social media, and news bias rating agencies. We present NEWSSENSE, a novel sensemaking tool and reading interface designed to collect and integrate information from multiple news articles on a central topic, using a form of reference-free fact verification. NEWSSENSE augments a central, grounding article of the user's choice by linking it to related articles from different sources, providing inline highlights on how specific claims in the chosen article are either supported or contradicted by information from other articles. Using NEWSSENSE, users can seamlessly digest and cross-check multiple information sources without disturbing their natural reading flow. Our pilot study shows that NEWSSENSE has the potential to help users identify key information, verify the credibility of news articles, and explore different perspectives., Comment: EMNLP 2023 (Demo Track)
Published: 2023

19. Selenite: Scaffolding Online Sensemaking with Comprehensive Overviews Elicited from Large Language Models

Author: Liu, Michael Xieyang, Wu, Tongshuang, Chen, Tianying, Li, Franklin Mingzhe, Kittur, Aniket, and Myers, Brad A.
Subjects: Computer Science - Human-Computer Interaction, Computer Science - Artificial Intelligence
Abstract: Sensemaking in unfamiliar domains can be challenging, demanding considerable user effort to compare different options with respect to various criteria. Prior research and our formative study found that people would benefit from reading an overview of an information space upfront, including the criteria others previously found useful. However, existing sensemaking tools struggle with the "cold-start" problem -- it not only requires significant input from previous users to generate and share these overviews, but such overviews may also turn out to be biased and incomplete. In this work, we introduce a novel system, Selenite, which leverages Large Language Models (LLMs) as reasoning machines and knowledge retrievers to automatically produce a comprehensive overview of options and criteria to jumpstart users' sensemaking processes. Subsequently, Selenite also adapts as people use it, helping users find, read, and navigate unfamiliar information in a systematic yet personalized manner. Through three studies, we found that Selenite produced accurate and high-quality overviews reliably, significantly accelerated users' information processing, and effectively improved their overall comprehension and sensemaking experience., Comment: Accepted to CHI 2024
Published: 2023
Full Text: View/download PDF

20. Prompt2Model: Generating Deployable Models from Natural Language Instructions

Author: Viswanathan, Vijay, Zhao, Chenyang, Bertsch, Amanda, Wu, Tongshuang, and Neubig, Graham
Subjects: Computer Science - Computation and Language
Abstract: Large language models (LLMs) enable system builders today to create competent NLP systems through prompting, where they only need to describe the task in natural language and provide a few examples. However, in other ways, LLMs are a step backward from traditional special-purpose NLP models; they require extensive computational resources for deployment and can be gated behind APIs. In this paper, we propose Prompt2Model, a general-purpose method that takes a natural language task description like the prompts provided to LLMs, and uses it to train a special-purpose model that is conducive to deployment. This is done through a multi-step process of retrieval of existing datasets and pretrained models, dataset generation using LLMs, and supervised fine-tuning on these retrieved and generated datasets. Over three tasks, we demonstrate that given the same few-shot prompt as input, Prompt2Model trains models that outperform the results of a strong LLM, gpt-3.5-turbo, by an average of 20% while being up to 700 times smaller. We also show that this data can be used to obtain reliable performance estimates of model performance, enabling model developers to assess model reliability before deployment. Prompt2Model is available open-source at https://github.com/neulab/prompt2model., Comment: 8 pages
Published: 2023

21. LLMs as Workers in Human-Computational Algorithms? Replicating Crowdsourcing Pipelines with LLMs

Author: Wu, Tongshuang, Zhu, Haiyi, Albayrak, Maya, Axon, Alexis, Bertsch, Amanda, Deng, Wenxing, Ding, Ziqi, Guo, Bill, Gururaja, Sireesh, Kuo, Tzu-Sheng, Liang, Jenny T., Liu, Ryan, Mandal, Ihita, Milbauer, Jeremiah, Ni, Xiaolin, Padmanabhan, Namrata, Ramkumar, Subhashini, Sudjianto, Alexis, Taylor, Jordan, Tseng, Ying-Jui, Vaidos, Patricia, Wu, Zhijin, Wu, Wei, and Yang, Chenyang
Subjects: Computer Science - Computation and Language, Computer Science - Human-Computer Interaction
Abstract: LLMs have shown promise in replicating human-like behavior in crowdsourcing tasks that were previously thought to be exclusive to human abilities. However, current efforts focus mainly on simple atomic tasks. We explore whether LLMs can replicate more complex crowdsourcing pipelines. We find that modern LLMs can simulate some of crowdworkers' abilities in these "human computation algorithms," but the level of success is variable and influenced by requesters' understanding of LLM capabilities, the specific skills required for sub-tasks, and the optimal interaction modality for performing these sub-tasks. We reflect on human and LLMs' different sensitivities to instructions, stress the importance of enabling human-facing safeguards for LLMs, and discuss the potential of training humans and LLMs with complementary skill sets. Crucially, we show that replicating crowdsourcing pipelines offers a valuable platform to investigate (1) the relative strengths of LLMs on different tasks (by cross-comparing their performances on sub-tasks) and (2) LLMs' potential in complex tasks, where they can complete part of the tasks while leaving others to humans.
Published: 2023

22. Large Language Models Enable Few-Shot Clustering

Author: Viswanathan, Vijay, Gashteovski, Kiril, Lawrence, Carolin, Wu, Tongshuang, and Neubig, Graham
Subjects: Computer Science - Computation and Language
Abstract: Unlike traditional unsupervised clustering, semi-supervised clustering allows users to provide meaningful structure to the data, which helps the clustering algorithm to match the user's intent. Existing approaches to semi-supervised clustering require a significant amount of feedback from an expert to improve the clusters. In this paper, we ask whether a large language model can amplify an expert's guidance to enable query-efficient, few-shot semi-supervised text clustering. We show that LLMs are surprisingly effective at improving clustering. We explore three stages where LLMs can be incorporated into clustering: before clustering (improving input features), during clustering (by providing constraints to the clusterer), and after clustering (using LLMs post-correction). We find incorporating LLMs in the first two stages can routinely provide significant improvements in cluster quality, and that LLMs enable a user to make trade-offs between cost and accuracy to produce desired clusters. We release our code and LLM prompts for the public to use.
Published: 2023

23. Is AI the better programming partner? Human-Human Pair Programming vs. Human-AI pAIr Programming

Author: Ma, Qianou, Wu, Tongshuang, and Koedinger, Kenneth
Subjects: Computer Science - Human-Computer Interaction, Computer Science - Artificial Intelligence
Abstract: The emergence of large-language models (LLMs) that excel at code generation and commercial products such as GitHub's Copilot has sparked interest in human-AI pair programming (referred to as "pAIr programming") where an AI system collaborates with a human programmer. While traditional pair programming between humans has been extensively studied, it remains uncertain whether its findings can be applied to human-AI pair programming. We compare human-human and human-AI pair programming, exploring their similarities and differences in interaction, measures, benefits, and challenges. We find that the effectiveness of both approaches is mixed in the literature (though the measures used for pAIr programming are not as comprehensive). We summarize moderating factors on the success of human-human pair programming, which provides opportunities for pAIr programming research. For example, mismatched expertise makes pair programming less productive, therefore well-designed AI programming assistants may adapt to differences in expertise levels., Comment: 8 pages (without references), 2 tables
Published: 2023

24. Seeing Seeds Beyond Weeds: Green Teaming Generative AI for Beneficial Uses

Author: Stapleton, Logan, Taylor, Jordan, Fox, Sarah, Wu, Tongshuang, and Zhu, Haiyi
Subjects: Computer Science - Human-Computer Interaction, Computer Science - Artificial Intelligence
Abstract: Large generative AI models (GMs) like GPT and DALL-E are trained to generate content for general, wide-ranging purposes. GM content filters are generalized to filter out content which has a risk of harm in many cases, e.g., hate speech. However, prohibited content is not always harmful -- there are instances where generating prohibited content can be beneficial. So, when GMs filter out content, they preclude beneficial use cases along with harmful ones. Which use cases are precluded reflects the values embedded in GM content filtering. Recent work on red teaming proposes methods to bypass GM content filters to generate harmful content. We coin the term green teaming to describe methods of bypassing GM content filters to design for beneficial use cases. We showcase green teaming by: 1) Using ChatGPT as a virtual patient to simulate a person experiencing suicidal ideation, for suicide support training; 2) Using Codex to intentionally generate buggy solutions to train students on debugging; and 3) Examining an Instagram page using Midjourney to generate images of anti-LGBTQ+ politicians in drag. Finally, we discuss how our use cases demonstrate green teaming as both a practical design method and a mode of critique, which problematizes and subverts current understandings of harms and values in generative AI.
Published: 2023

25. DataFinder: Scientific Dataset Recommendation from Natural Language Descriptions

Author: Viswanathan, Vijay, Gao, Luyu, Wu, Tongshuang, Liu, Pengfei, and Neubig, Graham
Subjects: Computer Science - Information Retrieval, Computer Science - Computation and Language, Computer Science - Digital Libraries
Abstract: Modern machine learning relies on datasets to develop and validate research ideas. Given the growth of publicly available data, finding the right dataset to use is increasingly difficult. Any research question imposes explicit and implicit constraints on how well a given dataset will enable researchers to answer this question, such as dataset size, modality, and domain. We operationalize the task of recommending datasets given a short natural language description of a research idea, to help people find relevant datasets for their needs. Dataset recommendation poses unique challenges as an information retrieval problem; datasets are hard to directly index for search and there are no corpora readily available for this task. To facilitate this task, we build the DataFinder Dataset which consists of a larger automatically-constructed training set (17.5K queries) and a smaller expert-annotated evaluation set (392 queries). Using this data, we compare various information retrieval algorithms on our test set and present a superior bi-encoder retriever for text-based dataset recommendation. This system, trained on the DataFinder Dataset, finds more relevant search results than existing third-party dataset search engines. To encourage progress on dataset recommendation, we release our dataset and models to the public., Comment: To appear at ACL 2023. Code published at https://github.com/viswavi/datafinder
Published: 2023

26. BiasX: 'Thinking Slow' in Toxic Content Moderation with Explanations of Implied Social Biases

Author: Zhang, Yiming, Nanduri, Sravani, Jiang, Liwei, Wu, Tongshuang, and Sap, Maarten
Subjects: Computer Science - Computation and Language
Abstract: Toxicity annotators and content moderators often default to mental shortcuts when making decisions. This can lead to subtle toxicity being missed, and seemingly toxic but harmless content being over-detected. We introduce BiasX, a framework that enhances content moderation setups with free-text explanations of statements' implied social biases, and explore its effectiveness through a large-scale crowdsourced user study. We show that indeed, participants substantially benefit from explanations for correctly identifying subtly (non-)toxic content. The quality of explanations is critical: imperfect machine-generated explanations (+2.4% on hard toxic examples) help less compared to expert-written human explanations (+7.2%). Our results showcase the promise of using free-text explanations to encourage more thoughtful toxicity moderation.
Published: 2023

27. ConvXAI: Delivering Heterogeneous AI Explanations via Conversations to Support Human-AI Scientific Writing

Author: Shen, Hua, Huang, Chieh-Yang, Wu, Tongshuang, and Huang, Ting-Hao 'Kenneth'
Subjects: Computer Science - Human-Computer Interaction, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Despite a surge collection of XAI methods, users still struggle to obtain required AI explanations. Previous research suggests chatbots as dynamic solutions, but the effective design of conversational XAI agents for practical human needs remains under-explored. This paper focuses on Conversational XAI for AI-assisted scientific writing tasks. Drawing from human linguistic theories and formative studies, we identify four design rationales: "multifaceted", "controllability", "mix-initiative", "context-aware drill-down". We incorporate them into an interactive prototype, ConvXAI, which facilitates heterogeneous AI explanations for scientific writing through dialogue. In two studies with 21 users, ConvXAI outperforms a GUI-based baseline on improving human-perceived understanding and writing improvement. The paper further discusses the practical human usage patterns in interacting with ConvXAI for scientific co-writing., Comment: CSCW 2023 Demo. ConvXAI system code: https://github.com/huashen218/convxai.git
Published: 2023

28. Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation

Author: Fernandes, Patrick, Madaan, Aman, Liu, Emmy, Farinhas, António, Martins, Pedro Henrique, Bertsch, Amanda, de Souza, José G. C., Zhou, Shuyan, Wu, Tongshuang, Neubig, Graham, and Martins, André F. T.
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Many recent advances in natural language generation have been fueled by training large language models on internet-scale data. However, this paradigm can lead to models that generate toxic, inaccurate, and unhelpful content, and automatic evaluation metrics often fail to identify these behaviors. As models become more capable, human feedback is an invaluable signal for evaluating and improving models. This survey aims to provide an overview of the recent research that has leveraged human feedback to improve natural language generation. First, we introduce an encompassing formalization of feedback, and identify and organize existing research into a taxonomy following this formalization. Next, we discuss how feedback can be described by its format and objective, and cover the two approaches proposed to use feedback (either for training or decoding): directly using the feedback or training feedback models. We also discuss existing datasets for human-feedback data collection, and concerns surrounding feedback collection. Finally, we provide an overview of the nascent field of AI feedback, which exploits large language models to make judgments based on a set of principles and minimize the need for human intervention., Comment: Work in Progress
Published: 2023

29. Tool Learning with Foundation Models

Author: Qin, Yujia, Hu, Shengding, Lin, Yankai, Chen, Weize, Ding, Ning, Cui, Ganqu, Zeng, Zheni, Huang, Yufei, Xiao, Chaojun, Han, Chi, Fung, Yi Ren, Su, Yusheng, Wang, Huadong, Qian, Cheng, Tian, Runchu, Zhu, Kunlun, Liang, Shihao, Shen, Xingyu, Xu, Bokai, Zhang, Zhen, Ye, Yining, Li, Bowen, Tang, Ziwei, Yi, Jing, Zhu, Yuzhang, Dai, Zhenning, Yan, Lan, Cong, Xin, Lu, Yaxi, Zhao, Weilin, Huang, Yuxiang, Yan, Junxi, Han, Xu, Sun, Xian, Li, Dahai, Phang, Jason, Yang, Cheng, Wu, Tongshuang, Ji, Heng, Liu, Zhiyuan, and Sun, Maosong
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Humans possess an extraordinary ability to create and utilize tools, allowing them to overcome physical limitations and explore new frontiers. With the advent of foundation models, AI systems have the potential to be equally adept in tool use as humans. This paradigm, i.e., tool learning with foundation models, combines the strengths of specialized tools and foundation models to achieve enhanced accuracy, efficiency, and automation in problem-solving. Despite its immense potential, there is still a lack of a comprehensive understanding of key challenges, opportunities, and future endeavors in this field. To this end, we present a systematic investigation of tool learning in this paper. We first introduce the background of tool learning, including its cognitive origins, the paradigm shift of foundation models, and the complementary roles of tools and models. Then we recapitulate existing tool learning research into tool-augmented and tool-oriented learning. We formulate a general tool learning framework: starting from understanding the user instruction, models should learn to decompose a complex task into several subtasks, dynamically adjust their plan through reasoning, and effectively conquer each sub-task by selecting appropriate tools. We also discuss how to train models for improved tool-use capabilities and facilitate the generalization in tool learning. Considering the lack of a systematic tool learning evaluation in prior works, we experiment with 18 representative tools and show the potential of current foundation models in skillfully utilizing tools. Finally, we discuss several open problems that require further investigation for tool learning. In general, we hope this paper could inspire future research in integrating tools with foundation models.
Published: 2023

30. Parachute: Evaluating Interactive Human-LM Co-writing Systems

Author: Shen, Hua and Wu, Tongshuang
Subjects: Computer Science - Human-Computer Interaction, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: A surge of advances in language models (LMs) has led to significant interest in using LMs to build co-writing systems, in which humans and LMs interactively contribute to a shared writing artifact. However, there is a lack of studies assessing co-writing systems in interactive settings. We propose a human-centered evaluation framework, Parachute, for interactive co-writing systems. Parachute showcases an integrative view of interaction evaluation, where each evaluation aspect consists of categorized practical metrics. Furthermore, we present Parachute with a use case to demonstrate how to evaluate and compare co-writing systems using Parachute., Comment: Accepted by CHI'23 In2Writing Workshop
Published: 2023

31. ScatterShot: Interactive In-context Example Curation for Text Transformation

Author: Wu, Tongshuang, Shen, Hua, Weld, Daniel S., Heer, Jeffrey, and Ribeiro, Marco Tulio
Subjects: Computer Science - Human-Computer Interaction, Computer Science - Computation and Language
Abstract: The in-context learning capabilities of LLMs like GPT-3 allow annotators to customize an LLM to their specific tasks with a small number of examples. However, users tend to include only the most obvious patterns when crafting examples, resulting in underspecified in-context functions that fall short on unseen cases. Further, it is hard to know when "enough" examples have been included even for known patterns. In this work, we present ScatterShot, an interactive system for building high-quality demonstration sets for in-context learning. ScatterShot iteratively slices unlabeled data into task-specific patterns, samples informative inputs from underexplored or not-yet-saturated slices in an active learning manner, and helps users label more efficiently with the help of an LLM and the current example set. In simulation studies on two text perturbation scenarios, ScatterShot sampling improves the resulting few-shot functions by 4-5 percentage points over random sampling, with less variance as more examples are added. In a user study, ScatterShot greatly helps users in covering different patterns in the input space and labeling in-context examples more efficiently, resulting in better in-context learning and less user effort., Comment: IUI 2023: 28th International Conference on Intelligent User Interfaces
Published: 2023
Full Text: View/download PDF

32. Decisions that Explain Themselves: A User-Centric Deep Reinforcement Learning Explanation System

Author: Wu, Xiaoran, Yan, Zihan, Zhang, Chongjie, and Wu, Tongshuang
Subjects: Computer Science - Human-Computer Interaction
Abstract: With deep reinforcement learning (RL) systems like autonomous driving being wildly deployed but remaining largely opaque, developers frequently use explainable RL (XRL) tools to better understand and work with deep RL agents. However, previous XRL works employ a techno-centric research approach, ignoring how RL developers perceive the generated explanations. Through a pilot study, we identify major goals for RL practitioners to use XRL methods and four pitfalls that widen the gap between existing XRL methods and these goals. The pitfalls include inaccessible reasoning processes, inconsistent or unintelligible explanations, and explanations that cannot be generalized. To fill the discovered gap, we propose a counterfactual-inference-based explanation method that discovers the details of the reasoning process of RL agents and generates natural language explanations. Surrounding this method, we build an interactive XRL system where users can actively explore explanations and influential information. In a user study with 14 participants, we validated that developers identified 20.9% more abnormal behaviors and limitations of RL agents with our system compared to the baseline method, and using our system helped end users improve their performance in actionability tests by 25.1% in an auto-driving task and by 16.9% in a StarCraft II micromanagement task.
Published: 2022

33. Capabilities for Better ML Engineering

Author: Yang, Chenyang, Brower-Sinning, Rachel, Lewis, Grace A., Kästner, Christian, and Wu, Tongshuang
Subjects: Computer Science - Artificial Intelligence, Computer Science - Software Engineering
Abstract: In spite of machine learning's rapid growth, its engineering support is scattered in many forms, and tends to favor certain engineering stages, stakeholders, and evaluation preferences. We envision a capability-based framework, which uses fine-grained specifications for ML model behaviors to unite existing efforts towards better ML engineering. We use concrete scenarios (model design, debugging, and maintenance) to articulate capabilities' broad applications across various different dimensions, and their impact on building safer, more generalizable and more trustworthy models that reflect human needs. Through preliminary experiments, we show capabilities' potential for reflecting model generalizability, which can provide guidance for ML engineering process. We discuss challenges and opportunities for capabilities' integration into ML engineering.
Published: 2022

34. Towards Natural Language-Based Visualization Authoring

Author: Wang, Yun, Hou, Zhitao, Shen, Leixian, Wu, Tongshuang, Wang, Jiaqi, Huang, He, Zhang, Haidong, and Zhang, Dongmei
Subjects: Computer Science - Human-Computer Interaction
Abstract: A key challenge to visualization authoring is the process of getting familiar with the complex user interfaces of authoring tools. Natural Language Interface (NLI) presents promising benefits due to its learnability and usability. However, supporting NLIs for authoring tools requires expertise in natural language processing, while existing NLIs are mostly designed for visual analytic workflow. In this paper, we propose an authoring-oriented NLI pipeline by introducing a structured representation of users' visualization editing intents, called editing actions, based on a formative study and an extensive survey on visualization construction tools. The editing actions are executable, and thus decouple natural language interpretation and visualization applications as an intermediate layer. We implement a deep learning-based NL interpreter to translate NL utterances into editing actions. The interpreter is reusable and extensible across authoring tools. The authoring tools only need to map the editing actions into tool-specific operations. To illustrate the usages of the NL interpreter, we implement an Excel chart editor and a proof-of-concept authoring tool, VisTalk. We conduct a user study with VisTalk to understand the usage patterns of NL-based authoring systems. Finally, we discuss observations on how users author charts with natural language, as well as implications for future research.
Published: 2022
Full Text: View/download PDF

35. Fantastic Questions and Where to Find Them: FairytaleQA -- An Authentic Dataset for Narrative Comprehension

Author: Xu, Ying, Wang, Dakuo, Yu, Mo, Ritchie, Daniel, Yao, Bingsheng, Wu, Tongshuang, Zhang, Zheng, Li, Toby Jia-Jun, Bradford, Nora, Sun, Branda, Hoang, Tran Bao, Sang, Yisi, Hou, Yufang, Ma, Xiaojuan, Yang, Diyi, Peng, Nanyun, Yu, Zhou, and Warschauer, Mark
Subjects: Computer Science - Computation and Language
Abstract: Question answering (QA) is a fundamental means to facilitate assessment and training of narrative comprehension skills for both machines and young children, yet there is scarcity of high-quality QA datasets carefully designed to serve this purpose. In particular, existing datasets rarely distinguish fine-grained reading skills, such as the understanding of varying narrative elements. Drawing on the reading education research, we introduce FairytaleQA, a dataset focusing on narrative comprehension of kindergarten to eighth-grade students. Generated by educational experts based on an evidence-based theoretical framework, FairytaleQA consists of 10,580 explicit and implicit questions derived from 278 children-friendly stories, covering seven types of narrative elements or relations. Our dataset is valuable in two folds: First, we ran existing QA models on our dataset and confirmed that this annotation helps assess models' fine-grained learning skills. Second, the dataset supports question generation (QG) task in the education domain. Through benchmarking with QG models, we show that the QG model trained on FairytaleQA is capable of asking high-quality and more diverse questions., Comment: Accepted to ACL 2022
Published: 2022

36. Are Shortest Rationales the Best Explanations for Human Understanding?

Author: Shen, Hua, Wu, Tongshuang, Guo, Wenbo, and Huang, Ting-Hao 'Kenneth'
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Human-Computer Interaction, Computer Science - Machine Learning
Abstract: Existing self-explaining models typically favor extracting the shortest possible rationales - snippets of an input text "responsible for" corresponding output - to explain the model prediction, with the assumption that shorter rationales are more intuitive to humans. However, this assumption has yet to be validated. Is the shortest rationale indeed the most human-understandable? To answer this question, we design a self-explaining model, LimitedInk, which allows users to extract rationales at any target length. Compared to existing baselines, LimitedInk achieves compatible end-task performance and human-annotated rationale agreement, making it a suitable representation of the recent class of self-explaining models. We use LimitedInk to conduct a user study on the impact of rationale length, where we ask human judges to predict the sentiment label of documents based only on LimitedInk-generated rationales with different lengths. We show rationales that are too short do not help humans predict labels better than randomly masked text, suggesting the need for more careful design of the best human rationales., Comment: To appear in ACL 2022 main conference
Published: 2022

37. PromptChainer: Chaining Large Language Model Prompts through Visual Programming

Author: Wu, Tongshuang, Jiang, Ellen, Donsbach, Aaron, Gray, Jeff, Molina, Alejandra, Terry, Michael, and Cai, Carrie J
Subjects: Computer Science - Human-Computer Interaction
Abstract: While LLMs can effectively help prototype single ML functionalities, many real-world applications involve complex tasks that cannot be easily handled via a single run of an LLM. Recent work has found that chaining multiple LLM runs together (with the output of one step being the input to the next) can help users accomplish these more complex tasks, and in a way that is perceived to be more transparent and controllable. However, it remains unknown what users need when authoring their own LLM chains -- a key step for lowering the barriers for non-AI-experts to prototype AI-infused applications. In this work, we explore the LLM chain authoring process. We conclude from pilot studies find that chaining requires careful scaffolding for transforming intermediate node outputs, as well as debugging the chain at multiple granularities; to help with these needs, we designed PromptChainer, an interactive interface for visually programming chains. Through case studies with four people, we show that PromptChainer supports building prototypes for a range of applications, and conclude with open questions on scaling chains to complex tasks, and supporting low-fi chain prototyping., Comment: CHI LBW 2022
Published: 2022

38. StoryBuddy: A Human-AI Collaborative Chatbot for Parent-Child Interactive Storytelling with Flexible Parental Involvement

Author: Zhang, Zheng, Xu, Ying, Wang, Yanhao, Yao, Bingsheng, Ritchie, Daniel, Wu, Tongshuang, Yu, Mo, Wang, Dakuo, and Li, Toby Jia-Jun
Subjects: Computer Science - Human-Computer Interaction, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Despite its benefits for children's skill development and parent-child bonding, many parents do not often engage in interactive storytelling by having story-related dialogues with their child due to limited availability or challenges in coming up with appropriate questions. While recent advances made AI generation of questions from stories possible, the fully-automated approach excludes parent involvement, disregards educational goals, and underoptimizes for child engagement. Informed by need-finding interviews and participatory design (PD) results, we developed StoryBuddy, an AI-enabled system for parents to create interactive storytelling experiences. StoryBuddy's design highlighted the need for accommodating dynamic user needs between the desire for parent involvement and parent-child bonding and the goal of minimizing parent intervention when busy. The PD revealed varied assessment and educational goals of parents, which StoryBuddy addressed by supporting configuring question types and tracking child progress. A user study validated StoryBuddy's usability and suggested design insights for future parent-AI collaboration systems., Comment: Published at CHI 2022
Published: 2022
Full Text: View/download PDF

39. Pretty Princess vs. Successful Leader: Gender Roles in Greeting Card Messages

Author: Sun, Jiao, Wu, Tongshuang, Jiang, Yue, Awalegaonkar, Ronil, Lin, Xi Victoria, and Yang, Diyi
Subjects: Computer Science - Human-Computer Interaction
Abstract: People write personalized greeting cards on various occasions. While prior work has studied gender roles in greeting card messages, systematic analysis at scale and tools for raising the awareness of gender stereotyping remain under-investigated. To this end, we collect a large greeting card message corpus covering three different occasions (birthday, Valentine's Day and wedding) from three sources (exemplars from greeting message websites, real-life greetings from social media and language model generated ones). We uncover a wide range of gender stereotypes in this corpus via topic modeling, odds ratio and Word Embedding Association Test (WEAT). We further conduct a survey to understand people's perception of gender roles in messages from this corpus and if gender stereotyping is a concern. The results show that people want to be aware of gender roles in the messages, but remain unconcerned unless the perceived gender roles conflict with the recipient's true personality. In response, we developed GreetA, an interactive visualization and writing assistant tool to visualize fine-grained topics in greeting card messages drafted by the users and the associated gender perception scores, but without suggesting text changes as an intervention., Comment: CHI 2022
Published: 2021

40. NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation

Author: Dhole, Kaustubh D., Gangal, Varun, Gehrmann, Sebastian, Gupta, Aadesh, Li, Zhenhao, Mahamood, Saad, Mahendiran, Abinaya, Mille, Simon, Shrivastava, Ashish, Tan, Samson, Wu, Tongshuang, Sohl-Dickstein, Jascha, Choi, Jinho D., Hovy, Eduard, Dusek, Ondrej, Ruder, Sebastian, Anand, Sajant, Aneja, Nagender, Banjade, Rabin, Barthe, Lisa, Behnke, Hanna, Berlot-Attwell, Ian, Boyle, Connor, Brun, Caroline, Cabezudo, Marco Antonio Sobrevilla, Cahyawijaya, Samuel, Chapuis, Emile, Che, Wanxiang, Choudhary, Mukund, Clauss, Christian, Colombo, Pierre, Cornell, Filip, Dagan, Gautier, Das, Mayukh, Dixit, Tanay, Dopierre, Thomas, Dray, Paul-Alexis, Dubey, Suchitra, Ekeinhor, Tatiana, Di Giovanni, Marco, Goyal, Tanya, Gupta, Rishabh, Hamla, Louanes, Han, Sang, Harel-Canada, Fabrice, Honore, Antoine, Jindal, Ishan, Joniak, Przemyslaw K., Kleyko, Denis, Kovatchev, Venelin, Krishna, Kalpesh, Kumar, Ashutosh, Langer, Stefan, Lee, Seungjae Ryan, Levinson, Corey James, Liang, Hualou, Liang, Kaizhao, Liu, Zhexiong, Lukyanenko, Andrey, Marivate, Vukosi, de Melo, Gerard, Meoni, Simon, Meyer, Maxime, Mir, Afnan, Moosavi, Nafise Sadat, Muennighoff, Niklas, Mun, Timothy Sum Hon, Murray, Kenton, Namysl, Marcin, Obedkova, Maria, Oli, Priti, Pasricha, Nivranshu, Pfister, Jan, Plant, Richard, Prabhu, Vinay, Pais, Vasile, Qin, Libo, Raji, Shahab, Rajpoot, Pawan Kumar, Raunak, Vikas, Rinberg, Roy, Roberts, Nicolas, Rodriguez, Juan Diego, Roux, Claude, S., Vasconcellos P. H., Sai, Ananya B., Schmidt, Robin M., Scialom, Thomas, Sefara, Tshephisho, Shamsi, Saqib N., Shen, Xudong, Shi, Haoyue, Shi, Yiwen, Shvets, Anna, Siegel, Nick, Sileo, Damien, Simon, Jamie, Singh, Chandan, Sitelew, Roman, Soni, Priyank, Sorensen, Taylor, Soto, William, Srivastava, Aman, Srivatsa, KV Aditya, Sun, Tony, T, Mukund Varma, Tabassum, A, Tan, Fiona Anting, Teehan, Ryan, Tiwari, Mo, Tolkiehn, Marie, Wang, Athena, Wang, Zijian, Wang, Gloria, Wang, Zijie J., Wei, Fuxuan, Wilie, Bryan, Winata, Genta Indra, Wu, Xinyi, Wydmański, Witold, Xie, Tianbao, Yaseen, Usama, Yee, Michael A., Zhang, Jing, and Zhang, Yue
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Data augmentation is an important component in the robustness evaluation of models in natural language processing (NLP) and in enhancing the diversity of the data they are trained on. In this paper, we present NL-Augmenter, a new participatory Python-based natural language augmentation framework which supports the creation of both transformations (modifications to the data) and filters (data splits according to specific features). We describe the framework and an initial set of 117 transformations and 23 filters for a variety of natural language tasks. We demonstrate the efficacy of NL-Augmenter by using several of its transformations to analyze the robustness of popular natural language models. The infrastructure, datacards and robustness analysis results are available publicly on the NL-Augmenter repository (https://github.com/GEM-benchmark/NL-Augmenter)., Comment: 39 pages, repository at https://github.com/GEM-benchmark/NL-Augmenter
Published: 2021

41. AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts

Author: Wu, Tongshuang, Terry, Michael, and Cai, Carrie J.
Subjects: Computer Science - Human-Computer Interaction, Computer Science - Computation and Language
Abstract: Although large language models (LLMs) have demonstrated impressive potential on simple tasks, their breadth of scope, lack of transparency, and insufficient controllability can make them less effective when assisting humans on more complex tasks. In response, we introduce the concept of Chaining LLM steps together, where the output of one step becomes the input for the next, thus aggregating the gains per step. We first define a set of LLM primitive operations useful for Chain construction, then present an interactive system where users can modify these Chains, along with their intermediate results, in a modular way. In a 20-person user study, we found that Chaining not only improved the quality of task outcomes, but also significantly enhanced system transparency, controllability, and sense of collaboration. Additionally, we saw that users developed new ways of interacting with LLMs through Chains: they leveraged sub-tasks to calibrate model expectations, compared and contrasted alternative strategies by observing parallel downstream effects, and debugged unexpected model outputs by "unit-testing" sub-components of a Chain. In two case studies, we further explore how LLM Chains may be used in future applications
Published: 2021

42. It is AI's Turn to Ask Humans a Question: Question-Answer Pair Generation for Children's Story Books

Author: Yao, Bingsheng, Wang, Dakuo, Wu, Tongshuang, Zhang, Zheng, Li, Toby Jia-Jun, Yu, Mo, and Xu, Ying
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Existing question answering (QA) techniques are created mainly to answer questions asked by humans. But in educational applications, teachers often need to decide what questions they should ask, in order to help students to improve their narrative understanding capabilities. We design an automated question-answer generation (QAG) system for this education scenario: given a story book at the kindergarten to eighth-grade level as input, our system can automatically generate QA pairs that are capable of testing a variety of dimensions of a student's comprehension skills. Our proposed QAG model architecture is demonstrated using a new expert-annotated FairytaleQA dataset, which has 278 child-friendly storybooks with 10,580 QA pairs. Automatic and human evaluations show that our model outperforms state-of-the-art QAG baseline systems. On top of our QAG system, we also start to build an interactive story-telling application for the future real-world deployment in this educational scenario., Comment: Accepted to ACL 2022
Published: 2021

43. DeHumor: Visual Analytics for Decomposing Humor

Author: Wang, Xingbo, Ming, Yao, Wu, Tongshuang, Zeng, Haipeng, Wang, Yong, and Qu, Huamin
Subjects: Computer Science - Computation and Language, Computer Science - Human-Computer Interaction, Computer Science - Machine Learning, Computer Science - Multimedia, I.2.7, I.7.0, H.4.2, J.4
Abstract: Despite being a critical communication skill, grasping humor is challenging -- a successful use of humor requires a mixture of both engaging content build-up and an appropriate vocal delivery (e.g., pause). Prior studies on computational humor emphasize the textual and audio features immediately next to the punchline, yet overlooking longer-term context setup. Moreover, the theories are usually too abstract for understanding each concrete humor snippet. To fill in the gap, we develop DeHumor, a visual analytical system for analyzing humorous behaviors in public speaking. To intuitively reveal the building blocks of each concrete example, DeHumor decomposes each humorous video into multimodal features and provides inline annotations of them on the video script. In particular, to better capture the build-ups, we introduce content repetition as a complement to features introduced in theories of computational humor and visualize them in a context linking graph. To help users locate the punchlines that have the desired features to learn, we summarize the content (with keywords) and humor feature statistics on an augmented time matrix. With case studies on stand-up comedy shows and TED talks, we show that DeHumor is able to highlight various building blocks of humor examples. In addition, expert interviews with communication coaches and humor researchers demonstrate the effectiveness of DeHumor for multimodal humor analysis of speech content and vocal delivery., Comment: 15 pages. A preprint version of a publication at IEEE Transactions on Visualization and Computer Graphics (TVCG), 2021
Published: 2021
Full Text: View/download PDF

44. Tailor: Generating and Perturbing Text with Semantic Controls

Author: Ross, Alexis, Wu, Tongshuang, Peng, Hao, Peters, Matthew E., and Gardner, Matt
Subjects: Computer Science - Computation and Language
Abstract: Controlled text perturbation is useful for evaluating and improving model generalizability. However, current techniques rely on training a model for every target perturbation, which is expensive and hard to generalize. We present Tailor, a semantically-controlled text generation system. Tailor builds on a pretrained seq2seq model and produces textual outputs conditioned on control codes derived from semantic representations. We craft a set of operations to modify the control codes, which in turn steer generation towards targeted attributes. These operations can be further composed into higher-level ones, allowing for flexible perturbation strategies. We demonstrate the effectiveness of these perturbations in multiple applications. First, we use Tailor to automatically create high-quality contrast sets for four distinct natural language processing (NLP) tasks. These contrast sets contain fewer spurious artifacts and are complementary to manually annotated ones in their lexical diversity. Second, we show that Tailor perturbations can improve model generalization through data augmentation. Perturbing just 2% of training data leads to a 5.8-point gain on an NLI challenge set measuring reliance on syntactic heuristics.
Published: 2021

45. Polyjuice: Generating Counterfactuals for Explaining, Evaluating, and Improving Models

Author: Wu, Tongshuang, Ribeiro, Marco Tulio, Heer, Jeffrey, and Weld, Daniel S.
Subjects: Computer Science - Computation and Language
Abstract: While counterfactual examples are useful for analysis and training of NLP models, current generation methods either rely on manual labor to create very few counterfactuals, or only instantiate limited types of perturbations such as paraphrases or word substitutions. We present Polyjuice, a general-purpose counterfactual generator that allows for control over perturbation types and locations, trained by finetuning GPT-2 on multiple datasets of paired sentences. We show that Polyjuice produces diverse sets of realistic counterfactuals, which in turn are useful in various distinct applications: improving training and evaluation on three different tasks (with around 70% less annotation effort than manual generation), augmenting state-of-the-art explanation techniques, and supporting systematic counterfactual error analysis by revealing behaviors easily missed by human experts., Comment: ACL 2021, main conference, long paper
Published: 2021

46. Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance

Author: Bansal, Gagan, Wu, Tongshuang, Zhou, Joyce, Fok, Raymond, Nushi, Besmira, Kamar, Ece, Ribeiro, Marco Tulio, and Weld, Daniel S.
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Human-Computer Interaction, Computer Science - Machine Learning
Abstract: Many researchers motivate explainable AI with studies showing that human-AI team performance on decision-making tasks improves when the AI explains its recommendations. However, prior studies observed improvements from explanations only when the AI, alone, outperformed both the human and the best team. Can explanations help lead to complementary performance, where team accuracy is higher than either the human or the AI working solo? We conduct mixed-method user studies on three datasets, where an AI with accuracy comparable to humans helps participants solve a task (explaining itself in some conditions). While we observed complementary improvements from AI augmentation, they were not increased by explanations. Rather, explanations increased the chance that humans will accept the AI's recommendation, regardless of its correctness. Our result poses new challenges for human-centered AI: Can we develop explanatory approaches that encourage appropriate trust in AI, and therefore help generate (or improve) complementary performance?, Comment: CHI'21
Published: 2020

47. Beyond Accuracy: Behavioral Testing of NLP models with CheckList

Author: Ribeiro, Marco Tulio, Wu, Tongshuang, Guestrin, Carlos, and Singh, Sameer
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
Published: 2020

48. Technology-Enabled Disinformation: Summary, Lessons, and Recommendations

Author: Akers, John, Bansal, Gagan, Cadamuro, Gabriel, Chen, Christine, Chen, Quanze, Lin, Lucy, Mulcaire, Phoebe, Nandakumar, Rajalakshmi, Rockett, Matthew, Simko, Lucy, Toman, John, Wu, Tongshuang, Zeng, Eric, Zorn, Bill, and Roesner, Franziska
Subjects: Computer Science - Computers and Society
Abstract: Technology is increasingly used -- unintentionally (misinformation) or intentionally (disinformation) -- to spread false information at scale, with potentially broad-reaching societal effects. For example, technology enables increasingly realistic false images and videos, and hyper-personal targeting means different people may see different versions of reality. This report is the culmination of a PhD-level special topics course (https://courses.cs.washington.edu/courses/cse599b/18au/) in Computer Science & Engineering at the University of Washington's Paul G. Allen School in the fall of 2018. The goals of this course were to study (1) how technologies and today's technical platforms enable and support the creation and spread of such mis- and disinformation, as well as (2) how technical approaches could be used to mitigate these issues. In this report, we summarize the space of technology-enabled mis- and disinformation based on our investigations, and then surface our lessons and recommendations for technologists, researchers, platform designers, policymakers, and users.
Published: 2018

49. Selenite: Scaffolding Online Sensemaking with Comprehensive Overviews Elicited from Large Language Models

Author: Liu, Michael Xieyang, primary, Wu, Tongshuang, additional, Chen, Tianying, additional, Li, Franklin Mingzhe, additional, Kittur, Aniket, additional, and Myers, Brad A, additional
Published: 2024
Full Text: View/download PDF

50. Wikibench: Community-Driven Data Curation for AI Evaluation on Wikipedia

Author: Kuo, Tzu-Sheng, primary, Halfaker, Aaron Lee, additional, Cheng, Zirui, additional, Kim, Jiwoo, additional, Wu, Meng-Hsin, additional, Wu, Tongshuang, additional, Holstein, Kenneth, additional, and Zhu, Haiyi, additional
Published: 2024
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

172 results on '"Wu, Tongshuang"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources