Author: "Fried, Daniel" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Fried, Daniel"' showing total 1,580 results

Start Over Author "Fried, Daniel"

1,580 results on '"Fried, Daniel"'

1. Improving Model Factuality with Fine-grained Critique-based Evaluator

Author: Xie, Yiqing, Zhou, Wenxuan, Prakash, Pradyot, Jin, Di, Mao, Yuning, Fettes, Quintin, Talebzadeh, Arya, Wang, Sinong, Fang, Han, Rose, Carolyn, Fried, Daniel, and Zhang, Hejia
Subjects: Computer Science - Computation and Language
Abstract: Factuality evaluation aims to detect factual errors produced by language models (LMs) and hence guide the development of more factual models. Towards this goal, we train a factuality evaluator, FenCE, that provides LM generators with claim-level factuality feedback. We conduct data augmentation on a combination of public judgment datasets to train FenCE to (1) generate textual critiques along with scores and (2) make claim-level judgment based on diverse source documents obtained by various tools. We then present a framework that leverages FenCE to improve the factuality of LM generators by constructing training data. Specifically, we generate a set of candidate responses, leverage FenCE to revise and score each response without introducing lesser-known facts, and train the generator by preferring highly scored revised responses. Experiments show that our data augmentation methods improve the evaluator's accuracy by 2.9% on LLM-AggreFact. With FenCE, we improve Llama3-8B-chat's factuality rate by 14.45% on FActScore, outperforming state-of-the-art factuality finetuning methods by 6.96%.
Published: 2024

2. Human-aligned Chess with a Bit of Search

Author: Zhang, Yiming, Jacob, Athul Paul, Lai, Vivian, Fried, Daniel, and Ippolito, Daphne
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Chess has long been a testbed for AI's quest to match human intelligence, and in recent years, chess AI systems have surpassed the strongest humans at the game. However, these systems are not human-aligned; they are unable to match the skill levels of all human partners or model human-like behaviors beyond piece movement. In this paper, we introduce Allie, a chess-playing AI designed to bridge the gap between artificial and human intelligence in this classic game. Allie is trained on log sequences of real chess games to model the behaviors of human chess players across the skill spectrum, including non-move behaviors such as pondering times and resignations In offline evaluations, we find that Allie exhibits humanlike behavior: it outperforms the existing state-of-the-art in human chess move prediction and "ponders" at critical positions. The model learns to reliably assign reward at each game state, which can be used at inference as a reward function in a novel time-adaptive Monte-Carlo tree search (MCTS) procedure, where the amount of search depends on how long humans would think in the same positions. Adaptive search enables remarkable skill calibration; in a large-scale online evaluation against players with ratings from 1000 to 2600 Elo, our adaptive search method leads to a skill gap of only 49 Elo on average, substantially outperforming search-free and standard MCTS baselines. Against grandmaster-level (2500 Elo) opponents, Allie with adaptive search exhibits the strength of a fellow grandmaster, all while learning exclusively from humans.
Published: 2024

3. CRScore: Grounding Automated Evaluation of Code Review Comments in Code Claims and Smells

Author: Naik, Atharva, Alenius, Marcus, Fried, Daniel, and Rose, Carolyn
Subjects: Computer Science - Software Engineering, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: The task of automated code review has recently gained a lot of attention from the machine learning community. However, current review comment evaluation metrics rely on comparisons with a human-written reference for a given code change (also called a diff), even though code review is a one-to-many problem like generation and summarization with many "valid reviews" for a diff. To tackle these issues we develop a CRScore - a reference-free metric to measure dimensions of review quality like conciseness, comprehensiveness, and relevance. We design CRScore to evaluate reviews in a way that is grounded in claims and potential issues detected in the code by LLMs and static analyzers. We demonstrate that CRScore can produce valid, fine-grained scores of review quality that have the greatest alignment with human judgment (0.54 Spearman correlation) and are more sensitive than reference-based metrics. We also release a corpus of 2.6k human-annotated review quality scores for machine-generated and GitHub review comments to support the development of automated metrics.
Published: 2024

4. Agent Workflow Memory

Author: Wang, Zora Zhiruo, Mao, Jiayuan, Fried, Daniel, and Neubig, Graham
Subjects: Computer Science - Computation and Language
Abstract: Despite the potential of language model-based agents to solve real-world tasks such as web navigation, current methods still struggle with long-horizon tasks with complex action trajectories. In contrast, humans can flexibly solve complex tasks by learning reusable task workflows from past experiences and using them to guide future actions. To build agents that can similarly benefit from this process, we introduce Agent Workflow Memory (AWM), a method for inducing commonly reused routines, i.e., workflows, and selectively providing workflows to the agent to guide subsequent generations. AWM flexibly applies to both offline and online scenarios, where agents induce workflows from training examples beforehand or from test queries on the fly. We experiment on two major web navigation benchmarks -- Mind2Web and WebArena -- that collectively cover 1000+ tasks from 200+ domains across travel, shopping, and social media, among others. AWM substantially improves the baseline results by 24.6% and 51.1% relative success rate on Mind2Web and WebArena while reducing the number of steps taken to solve WebArena tasks successfully. Furthermore, online AWM robustly generalizes in cross-task, website, and domain evaluations, surpassing baselines from 8.9 to 14.0 absolute points as train-test task distribution gaps widen.
Published: 2024

5. ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness?

Author: Waghjale, Siddhant, Veerendranath, Vishruth, Wang, Zora Zhiruo, and Fried, Daniel
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Although large language models (LLMs) have been largely successful in generating functionally correct programs, conditioning models to produce efficient solutions while ensuring correctness remains a challenge. Further, unreliability in benchmarking code efficiency is a hurdle across varying hardware specifications for popular interpreted languages such as Python. In this paper, we present ECCO, a reproducible benchmark for evaluating program efficiency via two paradigms: natural language (NL) based code generation and history-based code editing. On ECCO, we adapt and thoroughly investigate the three most promising existing LLM-based approaches: in-context learning, iterative refinement with execution or NL feedback, and fine-tuning conditioned on execution and editing history. While most methods degrade functional correctness and moderately increase program efficiency, we find that adding execution information often helps maintain functional correctness, and NL feedback enhances more on efficiency. We release our benchmark to support future work on LLM-based generation of efficient code., Comment: EMNLP 2024; Project Page: https://ecco-code-eff.github.io/
Published: 2024

6. Tree Search for Language Model Agents

Author: Koh, Jing Yu, McAleer, Stephen, Fried, Daniel, and Salakhutdinov, Ruslan
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Autonomous agents powered by language models (LMs) have demonstrated promise in their ability to perform decision-making tasks such as web automation. However, a key limitation remains: LMs, primarily optimized for natural language understanding and generation, struggle with multi-step reasoning, planning, and using environmental feedback when attempting to solve realistic computer tasks. Towards addressing this, we propose an inference-time search algorithm for LM agents to explicitly perform exploration and multi-step planning in interactive web environments. Our approach is a form of best-first tree search that operates within the actual environment space, and is complementary with most existing state-of-the-art agents. It is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks. On the challenging VisualWebArena benchmark, applying our search algorithm on top of a GPT-4o agent yields a 39.7% relative increase in success rate compared to the same baseline without search, setting a state-of-the-art success rate of 26.4%. On WebArena, search also yields a 28.0% relative improvement over a baseline agent, setting a competitive success rate of 19.2%. Our experiments highlight the effectiveness of search for web agents, and we demonstrate that performance scales with increased test-time compute. We conduct a thorough analysis of our results to highlight improvements from search, limitations, and promising directions for future work. Our code and models are publicly released at https://jykoh.com/search-agents., Comment: 12 pages. Models and code available at https://jykoh.com/search-agents
Published: 2024

7. BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Author: Zhuo, Terry Yue, Vu, Minh Chien, Chim, Jenny, Hu, Han, Yu, Wenhao, Widyasari, Ratnadira, Yusuf, Imam Nur Bani, Zhan, Haolan, He, Junda, Paul, Indraneil, Brunner, Simon, Gong, Chen, Hoang, Thong, Zebaze, Armel Randy, Hong, Xiaoheng, Li, Wen-Ding, Kaddour, Jean, Xu, Ming, Zhang, Zhihan, Yadav, Prateek, Jain, Naman, Gu, Alex, Cheng, Zhoujun, Liu, Jiawei, Liu, Qian, Wang, Zijian, Lo, David, Hui, Binyuan, Muennighoff, Niklas, Fried, Daniel, Du, Xiaoning, de Vries, Harm, and Von Werra, Leandro
Subjects: Computer Science - Software Engineering, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Task automation has been greatly empowered by the recent advances in Large Language Models (LLMs) via Python code, where the tasks ranging from software engineering development to general-purpose reasoning. While current benchmarks have shown that LLMs can solve tasks using programs like human developers, the majority of their evaluations are limited to short and self-contained algorithmic tasks or standalone function calls. Solving challenging and practical requires the capability of utilizing diverse function calls as tools to efficiently implement functionalities like data analysis and web development. In addition, using multiple tools to solve a task needs compositional reasoning by accurately understanding complex instructions. Fulfilling both of these characteristics can pose a great challenge for LLMs.To assess how well LLMs can solve challenging and practical tasks via programs, we introduce BigCodeBench, a benchmark that challenges LLMs to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained tasks. To evaluate LLMs rigorously, each task encompasses 5.6 test cases with an average branch coverage of 99%. In addition, we propose a natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that automatically transforms the original docstrings into short instructions only with essential information. Our extensive evaluation of 60 LLMs shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%. The results underscore the need for further advancements in this area., Comment: 44 pages, 14 figures, 7 tables, built with love by the BigCode community :)
Published: 2024

8. CodeRAG-Bench: Can Retrieval Augment Code Generation?

Author: Wang, Zora Zhiruo, Asai, Akari, Yu, Xinyan Velocity, Xu, Frank F., Xie, Yiqing, Neubig, Graham, and Fried, Daniel
Subjects: Computer Science - Software Engineering, Computer Science - Computation and Language
Abstract: While language models (LMs) have proven remarkably adept at generating code, many programs are challenging for LMs to generate using their parametric knowledge alone. Providing external contexts such as library documentation can facilitate generating accurate and functional code. Despite the success of retrieval-augmented generation (RAG) in various text-oriented tasks, its potential for improving code generation remains under-explored. In this work, we conduct a systematic, large-scale analysis by asking: in what scenarios can retrieval benefit code generation models? and what challenges remain? We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks, including basic programming, open-domain, and repository-level problems. We aggregate documents from five sources for models to retrieve contexts: competition solutions, online tutorials, library documentation, StackOverflow posts, and GitHub repositories. We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources. While notable gains are made in final code generation by retrieving high-quality contexts across various settings, our analysis reveals room for improvement -- current retrievers still struggle to fetch useful contexts especially with limited lexical overlap, and generators fail to improve with limited context lengths or abilities to integrate additional contexts. We hope CodeRAG-Bench serves as an effective testbed to encourage further development of advanced code-oriented RAG methods.
Published: 2024

9. Adversarial Attacks on Multimodal Agents

Author: Wu, Chen Henry, Koh, Jing Yu, Salakhutdinov, Ruslan, Fried, Daniel, and Raghunathan, Aditi
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Cryptography and Security, Computer Science - Computer Vision and Pattern Recognition
Abstract: Vision-enabled language models (VLMs) are now used to build autonomous multimodal agents capable of taking actions in real environments. In this paper, we show that multimodal agents raise new safety risks, even though attacking agents is more challenging than prior attacks due to limited access to and knowledge about the environment. Our attacks use adversarial text strings to guide gradient-based perturbation over one trigger image in the environment: (1) our captioner attack attacks white-box captioners if they are used to process images into captions as additional inputs to the VLM; (2) our CLIP attack attacks a set of CLIP models jointly, which can transfer to proprietary VLMs. To evaluate the attacks, we curated VisualWebArena-Adv, a set of adversarial tasks based on VisualWebArena, an environment for web-based multimodal agent tasks. Within an L-infinity norm of $16/256$ on a single image, the captioner attack can make a captioner-augmented GPT-4V agent execute the adversarial goals with a 75% success rate. When we remove the captioner or use GPT-4V to generate its own captions, the CLIP attack can achieve success rates of 21% and 43%, respectively. Experiments on agents based on other VLMs, such as Gemini-1.5, Claude-3, and GPT-4o, show interesting differences in their robustness. Further analysis reveals several key factors contributing to the attack's success, and we also discuss the implications for defenses as well. Project page: https://chenwu.io/attack-agent Code and data: https://github.com/ChenWu98/agent-attack, Comment: 19 pages
Published: 2024

10. Amortizing Pragmatic Program Synthesis with Rankings

Author: Pu, Yewen, Vaduguru, Saujas, Vaithilingam, Priyan, Glassman, Elena, and Fried, Daniel
Subjects: Computer Science - Programming Languages, Computer Science - Artificial Intelligence
Abstract: The usage of Rational Speech Acts (RSA) framework has been successful in building \emph{pragmatic} program synthesizers that return programs which, in addition to being logically consistent with user-generated examples, account for the fact that a user chooses their examples informatively. We present a general method of amortizing the slow, exact RSA synthesizer. Our method first query the exact RSA synthesizer to compile a communication dataset. The dataset contains a number of example-dependent rankings of subsets of programs. It then distills a \textit{single} global ranking of all programs as an approximation to every ranking in the dataset. This global ranking is then used at inference time to rank multiple logically consistent candidate programs generated from a fast, non-pragmatic synthesizer. Experiments on two program synthesis domains using our ranking method resulted in orders of magnitudes of speed ups compared to the exact RSA synthesizer, while being more accurate than a non-pragmatic synthesizer when communicating with humans. Finally, we prove that in the special case of synthesis from a single example, this approximation is exact., Comment: icml 2024. This work supersedes and serves as a new version of arXiv:2309.03225
Published: 2024

11. Evaluating Large Language Model Biases in Persona-Steered Generation

Author: Liu, Andy, Diab, Mona, and Fried, Daniel
Subjects: Computer Science - Computation and Language
Abstract: The task of persona-steered text generation requires large language models (LLMs) to generate text that reflects the distribution of views that an individual fitting a persona could have. People have multifaceted personas, but prior work on bias in LLM-generated opinions has only explored multiple-choice settings or one-dimensional personas. We define an incongruous persona as a persona with multiple traits where one trait makes its other traits less likely in human survey data, e.g. political liberals who support increased military spending. We find that LLMs are 9.7% less steerable towards incongruous personas than congruous ones, sometimes generating the stereotypical stance associated with its demographic rather than the target stance. Models that we evaluate that are fine-tuned with Reinforcement Learning from Human Feedback (RLHF) are more steerable, especially towards stances associated with political liberals and women, but present significantly less diverse views of personas. We also find variance in LLM steerability that cannot be predicted from multiple-choice opinion evaluation. Our results show the importance of evaluating models in open-ended text generation, as it can surface new LLM opinion biases. Moreover, such a setup can shed light on our ability to steer models toward a richer and more diverse range of viewpoints., Comment: Accepted to Findings of ACL 2024. Code and data available at https://github.com/andyjliu/persona-steered-generation-bias
Published: 2024

12. Human-Agent Cooperation in Games under Incomplete Information through Natural Language Communication

Author: Chen, Shenghui, Fried, Daniel, and Topcu, Ufuk
Subjects: Computer Science - Artificial Intelligence, Computer Science - Human-Computer Interaction
Abstract: Developing autonomous agents that can strategize and cooperate with humans under information asymmetry is challenging without effective communication in natural language. We introduce a shared-control game, where two players collectively control a token in alternating turns to achieve a common objective under incomplete information. We formulate a policy synthesis problem for an autonomous agent in this game with a human as the other player. To solve this problem, we propose a communication-based approach comprising a language module and a planning module. The language module translates natural language messages into and from a finite set of flags, a compact representation defined to capture player intents. The planning module leverages these flags to compute a policy using an asymmetric information-set Monte Carlo tree search with flag exchange algorithm we present. We evaluate the effectiveness of this approach in a testbed based on Gnomes at Night, a search-and-find maze board game. Results of human subject experiments show that communication narrows the information gap between players and enhances human-agent cooperation efficiency with fewer turns., Comment: with appendix
Published: 2024

13. Is the Pope Catholic? Yes, the Pope is Catholic. Generative Evaluation of Non-Literal Intent Resolution in LLMs

Author: Yerukola, Akhila, Vaduguru, Saujas, Fried, Daniel, and Sap, Maarten
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Humans often express their communicative intents indirectly or non-literally, which requires their interlocutors -- human or AI -- to understand beyond the literal meaning of words. While most existing work has focused on discriminative evaluations, we present a new approach to generatively evaluate large language models' (LLMs') intention understanding by examining their responses to non-literal utterances. Ideally, an LLM should respond in line with the true intention of a non-literal utterance, not its literal interpretation. Our findings show that LLMs struggle to generate pragmatically relevant responses to non-literal language, achieving only 50-55% accuracy on average. While explicitly providing oracle intentions significantly improves performance (e.g., 75% for Mistral-Instruct), this still indicates challenges in leveraging given intentions to produce appropriate responses. Using chain-of-thought to make models spell out intentions yields much smaller gains (60% for Mistral-Instruct). These findings suggest that LLMs are not yet effective pragmatic interlocutors, highlighting the need for better approaches for modeling intentions and utilizing them for pragmatic generation.
Published: 2024

14. Dialogue with Robots: Proposals for Broadening Participation and Research in the SLIVAR Community

Author: Kennington, Casey, Alikhani, Malihe, Pon-Barry, Heather, Atwell, Katherine, Bisk, Yonatan, Fried, Daniel, Gervits, Felix, Han, Zhao, Inan, Mert, Johnston, Michael, Korpan, Raj, Litman, Diane, Marge, Matthew, Matuszek, Cynthia, Mead, Ross, Mohan, Shiwali, Mooney, Raymond, Parde, Natalie, Sinapov, Jivko, Stewart, Angela, Stone, Matthew, Tellex, Stefanie, and Williams, Tom
Subjects: Computer Science - Computation and Language, Computer Science - Robotics
Abstract: The ability to interact with machines using natural human language is becoming not just commonplace, but expected. The next step is not just text interfaces, but speech interfaces and not just with computers, but with all machines including robots. In this paper, we chronicle the recent history of this growing field of spoken dialogue with robots and offer the community three proposals, the first focused on education, the second on benchmarks, and the third on the modeling of language when it comes to spoken interaction with robots. The three proposals should act as white papers for any researcher to take and build upon., Comment: NSF Report on the "Dialogue with Robots" Workshop held in Pittsburg, PA, April 2023
Published: 2024

15. CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks

Author: Xie, Yiqing, Xie, Alex, Sheth, Divyanshu, Liu, Pengfei, Fried, Daniel, and Rose, Carolyn
Subjects: Computer Science - Software Engineering, Computer Science - Computation and Language
Abstract: To adequately test modern code generation systems, evaluation benchmarks must execute and test the code generated by the system. However, these execution and testing requirements have largely limited benchmarks to settings where code is easily executable or has human-written tests. To facilitate evaluation of code generation systems across diverse scenarios, we present CodeBenchGen, a framework to create scalable execution-based benchmarks from naturally occurring code sources. Specifically, we leverage a large language model (LLM) to sandbox arbitrary pieces of code into evaluation examples, including test cases for execution-based evaluation. We illustrate the usefulness of our framework by creating a dataset, Exec-CSN, which includes 1,931 examples involving 293 libraries converted from code in 367 GitHub repositories taken from the Code- SearchNet dataset. To demonstrate the solvability of examples in Exec-CSN, we present a human study demonstrating that 81.3% of the examples can be solved by humans and 61% are rated as "requires effort to solve". We conduct code generation experiments on open-source and proprietary models and analyze the performance of both humans and models. We provide code and data at: https://github.com/yiqingxyq/CodeBenchGen.
Published: 2024

16. What Are Tools Anyway? A Survey from the Language Model Perspective

Author: Wang, Zhiruo, Cheng, Zhoujun, Zhu, Hao, Fried, Daniel, and Neubig, Graham
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Language models (LMs) are powerful yet mostly for text generation tasks. Tools have substantially enhanced their performance for tasks that require complex skills. However, many works adopt the term "tool" in different ways, raising the question: What is a tool anyway? Subsequently, where and how do tools help LMs? In this survey, we provide a unified definition of tools as external programs used by LMs, and perform a systematic review of LM tooling scenarios and approaches. Grounded on this review, we empirically study the efficiency of various tooling methods by measuring their required compute and performance gains on various benchmarks, and highlight some challenges and potential future research in the field.
Published: 2024

17. Repetition Improves Language Model Embeddings

Author: Springer, Jacob Mitchell, Kotha, Suhas, Fried, Daniel, Neubig, Graham, and Raghunathan, Aditi
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Recent approaches to improving the extraction of text embeddings from autoregressive large language models (LLMs) have largely focused on improvements to data, backbone pretrained language models, or improving task-differentiation via instructions. In this work, we address an architectural limitation of autoregressive models: token embeddings cannot contain information from tokens that appear later in the input. To address this limitation, we propose a simple approach, "echo embeddings," in which we repeat the input twice in context and extract embeddings from the second occurrence. We show that echo embeddings of early tokens can encode information about later tokens, allowing us to maximally leverage high-quality LLMs for embeddings. On the MTEB leaderboard, echo embeddings improve over classical embeddings by over 9% zero-shot and by around 0.7% when fine-tuned. Echo embeddings with a Mistral-7B model achieve state-of-the-art compared to prior open source models that do not leverage synthetic fine-tuning data., Comment: 36 pages, 11 figures, 16 tables
Published: 2024

18. VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

Author: Koh, Jing Yu, Lo, Robert, Jang, Lawrence, Duvvur, Vikram, Lim, Ming Chong, Huang, Po-Yu, Neubig, Graham, Zhou, Shuyan, Salakhutdinov, Ruslan, and Fried, Daniel
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Autonomous agents capable of planning, reasoning, and executing actions on the web offer a promising avenue for automating computer tasks. However, the majority of existing benchmarks primarily focus on text-based agents, neglecting many natural tasks that require visual information to effectively solve. Given that most computer interfaces cater to human perception, visual information often augments textual data in ways that text-only models struggle to harness effectively. To bridge this gap, we introduce VisualWebArena, a benchmark designed to assess the performance of multimodal web agents on realistic \textit{visually grounded tasks}. VisualWebArena comprises of a set of diverse and complex web-based tasks that evaluate various capabilities of autonomous multimodal agents. To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives. We conduct an extensive evaluation of state-of-the-art LLM-based autonomous agents, including several multimodal models. Through extensive quantitative and qualitative analysis, we identify several limitations of text-only LLM agents, and reveal gaps in the capabilities of state-of-the-art multimodal language agents. VisualWebArena provides a framework for evaluating multimodal autonomous language agents, and offers insights towards building stronger autonomous agents for the web. Our code, baseline models, and data is publicly available at https://jykoh.com/vwa., Comment: Accepted to ACL 2024. 24 pages. Project page: https://jykoh.com/vwa
Published: 2024

19. TroVE: Inducing Verifiable and Efficient Toolboxes for Solving Programmatic Tasks

Author: Wang, Zhiruo, Fried, Daniel, and Neubig, Graham
Subjects: Computer Science - Artificial Intelligence
Abstract: Language models (LMs) can solve tasks such as answering questions about tables or images by writing programs. However, using primitive functions often leads to verbose and error-prone programs, and higher-level functions require expert design. To enable better solutions without human labor, we ask code LMs to curate reusable high-level functions, and use them to write solutions. We present TROVE, a training-free method of inducing a verifiable and efficient toolbox of functions, by generating via using, growing, and periodically trimming the toolbox. On 11 datasets from math, table question answering, and image reasoning tasks, TROVE consistently yields simpler solutions with higher accuracy than baselines using CODELLAMA and previous methods using GPT, while using 79-98% smaller toolboxes. TROVE further enables 31% faster and 13% more accurate human verification than baselines. With the same pipeline, it creates diverse functions for varied tasks and datasets, providing insights into their individual characteristics.
Published: 2024

20. Asking More Informative Questions for Grounded Retrieval

Author: Keh, Sedrick, Chiu, Justin T., and Fried, Daniel
Subjects: Computer Science - Computation and Language
Abstract: When a model is trying to gather information in an interactive setting, it benefits from asking informative questions. However, in the case of a grounded multi-turn image identification task, previous studies have been constrained to polar yes/no questions, limiting how much information the model can gain in a single turn. We present an approach that formulates more informative, open-ended questions. In doing so, we discover that off-the-shelf visual question answering (VQA) models often make presupposition errors, which standard information gain question selection methods fail to account for. To address this issue, we propose a method that can incorporate presupposition handling into both question selection and belief updates. Specifically, we use a two-stage process, where the model first filters out images which are irrelevant to a given question, then updates its beliefs about which image the user intends. Through self-play and human evaluations, we show that our method is successful in asking informative open-ended questions, increasing accuracy over the past state-of-the-art by 14%, while resulting in 48% more efficient games in human evaluations.
Published: 2023

21. Generating Pragmatic Examples to Train Neural Program Synthesizers

Author: Vaduguru, Saujas, Fried, Daniel, and Pu, Yewen
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Programming Languages
Abstract: Programming-by-example is the task of synthesizing a program that is consistent with a set of user-provided input-output examples. As examples are often an under-specification of one's intent, a good synthesizer must choose the intended program from the many that are consistent with the given set of examples. Prior work frames program synthesis as a cooperative game between a listener (that synthesizes programs) and a speaker (a user choosing examples), and shows that models of computational pragmatic inference are effective in choosing the user intended programs. However, these models require counterfactual reasoning over a large set of programs and examples, which is infeasible in realistic program spaces. In this paper, we propose a novel way to amortize this search with neural networks. We sample pairs of programs and examples via self-play between listener and speaker models, and use pragmatic inference to choose informative training examples from this sample.We then use the informative dataset to train models to improve the synthesizer's ability to disambiguate user-provided examples without human supervision. We validate our method on the challenging task of synthesizing regular expressions from example strings, and find that our method (1) outperforms models trained without choosing pragmatic examples by 23% (a 51% relative increase) (2) matches the performance of supervised learning on a dataset of pragmatic examples provided by humans, despite using no human data in training.
Published: 2023

22. Comparative Knowledge Distillation

Author: Wilf, Alex, Xu, Alex Tianyi, Liang, Paul Pu, Obolenskiy, Alexander, Fried, Daniel, and Morency, Louis-Philippe
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: In the era of large scale pretrained models, Knowledge Distillation (KD) serves an important role in transferring the wisdom of computationally heavy teacher models to lightweight, efficient student models while preserving performance. Traditional KD paradigms, however, assume readily available access to teacher models for frequent inference -- a notion increasingly at odds with the realities of costly, often proprietary, large scale models. Addressing this gap, our paper considers how to minimize the dependency on teacher model inferences in KD in a setting we term Few Teacher Inference Knowledge Distillation (FTI KD). We observe that prevalent KD techniques and state of the art data augmentation strategies fall short in this constrained setting. Drawing inspiration from educational principles that emphasize learning through comparison, we propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples. Critically, CKD provides additional learning signals to the student without making additional teacher calls. We also extend the principle of CKD to groups of samples, enabling even more efficient learning from limited teacher calls. Empirical evaluation across varied experimental settings indicates that CKD consistently outperforms state of the art data augmentation and KD techniques., Comment: arXiv admin note: text overlap with arXiv:2310.13011
Published: 2023

23. Data Augmentation for Code Translation with Comparable Corpora and Multiple References

Author: Xie, Yiqing, Naik, Atharva, Fried, Daniel, and Rose, Carolyn
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Software Engineering
Abstract: One major challenge of translating code between programming languages is that parallel training data is often limited. To overcome this challenge, we present two data augmentation techniques, one that builds comparable corpora (i.e., code pairs with similar functionality), and another that augments existing parallel data with multiple reference translations. Specifically, we build and analyze multiple types of comparable corpora, including programs generated from natural language documentation using a code generation model. Furthermore, to reduce overfitting to a single reference translation, we automatically generate additional translation references for available parallel data and filter the translations by unit tests, which increases variation in target translations. Experiments show that our data augmentation techniques significantly improve CodeT5 for translation between Java, Python, and C++ by an average of 7.5% Computational Accuracy (CA@1), which verifies the correctness of translations by execution. The code is available at https://github.com/Veronicium/CMTrans., Comment: EMNLP 2023 Findings (with minor updates on the flowcharts)
Published: 2023

24. Symbolic Planning and Code Generation for Grounded Dialogue

Author: Chiu, Justin T., Zhao, Wenting, Chen, Derek, Vaduguru, Saujas, Rush, Alexander M., and Fried, Daniel
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Large language models (LLMs) excel at processing and generating both text and code. However, LLMs have had limited applicability in grounded task-oriented dialogue as they are difficult to steer toward task objectives and fail to handle novel grounding. We present a modular and interpretable grounded dialogue system that addresses these shortcomings by composing LLMs with a symbolic planner and grounded code execution. Our system consists of a reader and planner: the reader leverages an LLM to convert partner utterances into executable code, calling functions that perform grounding. The translated code's output is stored to track dialogue state, while a symbolic planner determines the next appropriate response. We evaluate our system's performance on the demanding OneCommon dialogue task, involving collaborative reference resolution on abstract images of scattered dots. Our system substantially outperforms the previous state-of-the-art, including improving task success in human evaluations from 56% to 69% in the most challenging setting., Comment: Accepted to EMNLP 2023
Published: 2023

25. API-Assisted Code Generation for Question Answering on Varied Table Structures

Author: Cao, Yihan, Chen, Shuyi, Liu, Ryan, Wang, Zhiruo, and Fried, Daniel
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: A persistent challenge to table question answering (TableQA) by generating executable programs has been adapting to varied table structures, typically requiring domain-specific logical forms. In response, this paper introduces a unified TableQA framework that: (1) provides a unified representation for structured tables as multi-index Pandas data frames, (2) uses Python as a powerful querying language, and (3) uses few-shot prompting to translate NL questions into Python programs, which are executable on Pandas data frames. Furthermore, to answer complex relational questions with extended program functionality and external knowledge, our framework allows customized APIs that Python programs can call. We experiment with four TableQA datasets that involve tables of different structures -- relational, multi-table, and hierarchical matrix shapes -- and achieve prominent improvements over past state-of-the-art systems. In ablation studies, we (1) show benefits from our multi-index representation and APIs over baselines that use only an LLM, and (2) demonstrate that our approach is modular and can incorporate additional APIs., Comment: EMNLP 2023 camera ready, 13 pages, 11 figures
Published: 2023

26. SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents

Author: Zhou, Xuhui, Zhu, Hao, Mathur, Leena, Zhang, Ruohong, Yu, Haofei, Qi, Zhengyang, Morency, Louis-Philippe, Bisk, Yonatan, Fried, Daniel, Neubig, Graham, and Sap, Maarten
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Humans are social beings; we pursue social goals in our daily interactions, which is a crucial aspect of social intelligence. Yet, AI systems' abilities in this realm remain elusive. We present SOTOPIA, an open-ended environment to simulate complex social interactions between artificial agents and evaluate their social intelligence. In our environment, agents role-play and interact under a wide variety of scenarios; they coordinate, collaborate, exchange, and compete with each other to achieve complex social goals. We simulate the role-play interaction between LLM-based agents and humans within this task space and evaluate their performance with a holistic evaluation framework called SOTOPIA-Eval. With SOTOPIA, we find significant differences between these models in terms of their social intelligence, and we identify a subset of SOTOPIA scenarios, SOTOPIA-hard, that is generally challenging for all models. We find that on this subset, GPT-4 achieves a significantly lower goal completion rate than humans and struggles to exhibit social commonsense reasoning and strategic communication skills. These findings demonstrate SOTOPIA's promise as a general platform for research on evaluating and improving social intelligence in artificial agents., Comment: Preprint, 43 pages. The first two authors contribute equally
Published: 2023

27. Amortizing Pragmatic Program Synthesis with Rankings

Author: Pu, Yewen, Vaduguru, Saujas, Vaithilingam, Priyan, Glassman, Elena, and Fried, Daniel
Subjects: Computer Science - Programming Languages, Computer Science - Artificial Intelligence, I.2.2, D.3.0
Abstract: In program synthesis, an intelligent system takes in a set of user-generated examples and returns a program that is logically consistent with these examples. The usage of Rational Speech Acts (RSA) framework has been successful in building \emph{pragmatic} program synthesizers that return programs which -- in addition to being logically consistent -- account for the fact that a user chooses their examples informatively. However, the computational burden of running the RSA algorithm has restricted the application of pragmatic program synthesis to domains with a small number of possible programs. This work presents a novel method of amortizing the RSA algorithm by leveraging a \emph{global pragmatic ranking} -- a single, total ordering of all the hypotheses. We prove that for a pragmatic synthesizer that uses a single demonstration, our global ranking method exactly replicates RSA's ranked responses. We further empirically show that global rankings effectively approximate the full pragmatic synthesizer in an online, multi-demonstration setting. Experiments on two program synthesis domains using our pragmatic ranking method resulted in orders of magnitudes of speed ups compared to the RSA synthesizer, while outperforming the standard, non-pragmatic synthesizer., Comment: I accidentally submitted a new version of this (arXiv:2407.02499) instead of replacing this one, so I'll take this one out as it is out-dated
Published: 2023

28. Clinical SWIR and CP-OCT imaging of interproximal lesions

Author: Zhu, Yihua, Le, Oanh, Xue, Joany, Wycoff, Spencer, and Fried, Daniel
Subjects: Biomedical and Clinical Sciences, Dentistry, Biomedical Imaging, Cancer, Bioengineering, Clinical Research, 4.2 Evaluation of markers and technologies, 4.1 Discovery and preclinical testing of markers and technologies, Humans, Tomography, Optical Coherence, Middle Aged, Aged, Adolescent, Aged, 80 and over, Adult, Dental Caries, Young Adult, Transillumination, Infrared Rays, Female, Male, Dental Enamel, Sensitivity and Specificity, Image Processing, Computer-Assisted, Dental caries, Caries detection, Optical coherence tomography, SWIR imaging, Interproximal lesions
Abstract: BackgroundEnamel is highly transparent at short wavelength infrared imaging (SWIR) wavelengths allowing the detection of dental decay without the need for ionizing radiation. The purpose of this study was to use SWIR imaging methods including cross polarization optical coherence tomography (CP-OCT), occlusal transillumination (SWIR-OT), proximal transillumination (SWIR-PT), and occlusal reflectance (SWIR-R) to image interproximal lesions in vivo and compare the sensitivity with radiography.MethodsParticipants (n = 30) aged 18-80 each with a radiopositive interproximal lesion scheduled for restoration were enrolled in the study. Studies have shown that the opposing proximal surfaces across the contact will likely also have lesions. SWIR images were acquired of the adjoining teeth at each contact with an interproximal lesion scheduled for restoration. Lesion presence and depth were assessed on each side of the contact for radiography and each SWIR imaging method. Lesions on radiographs and in CP-OCT images were identified by a single examiner while lesions in SWIR images were identified by a contrast threshold via semi-automatic image segmentation.ResultsAll SWIR imaging methods had significantly higher sensitivity (P
Published: 2024

29. Short-Wavelength Infrared Imaging of Infected and Affected Dentin

Author: Ng, Morgan, Ho, Yi-Ching, Wycoff, Spencer, Zhu, Yihua, and Fried, Daniel
Subjects: Biomedical and Clinical Sciences, Dentistry, SWIR imaging, caries detection, affected dentin, infected dentin, Clinical sciences
Abstract: Stains produced by bacteria or those found in blood and food byproducts accumulate in highly porous caries lesions. They can interfere with accurate diagnosis and the selective removal of carious tissue during cavity preparations. Short-wavelength infrared (SWIR) imaging studies have shown that stain molecules do not absorb light beyond 1200 nm. The objective of this study was to image affected and infected dentin atSWIR wavelengths. Sections of 3 mm thickness were cut from the extracted teeth with deep dentinal lesions. The sound (normal), affected (stained), and infected (demineralized) dentin on each section were examined with reflected light at wavelengths from 400 to 1700 nm, red and green fluorescence, and with optical coherence tomography (OCT). Microcomputed tomography (microCT) was used to measure the mineral density at each location investigated. Significant (p < 0.05) differences were observed in the reflected light intensity at 400-850 nm and for fluorescence between the sound, affected, and infected dentin. SWIR imaging did not show significant reductions in reflectivity for the affected and infected dentin. SWIR images may be valuable for monitoring the lateral spread of dentinal lesions on the occlusal surfaces of teeth.
Published: 2024

30. Neither Here nor There: The (Non-)Geographical Futures of Comparative Literature

Author: Fried, Daniel and Hui, Zhang
Published: 2014
Full Text: View/download PDF

31. WebArena: A Realistic Web Environment for Building Autonomous Agents

Author: Zhou, Shuyan, Xu, Frank F., Zhu, Hao, Zhou, Xuhui, Lo, Robert, Sridhar, Abishek, Cheng, Xianyi, Ou, Tianyue, Bisk, Yonatan, Fried, Daniel, Alon, Uri, and Neubig, Graham
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software development, and content management. Our environment is enriched with tools (e.g., a map) and external knowledge bases (e.g., user manuals) to encourage human-like task-solving. Building upon our environment, we release a set of benchmark tasks focusing on evaluating the functional correctness of task completions. The tasks in our benchmark are diverse, long-horizon, and designed to emulate tasks that humans routinely perform on the internet. We experiment with several baseline agents, integrating recent techniques such as reasoning before acting. The results demonstrate that solving complex tasks is challenging: our best GPT-4-based agent only achieves an end-to-end task success rate of 14.41%, significantly lower than the human performance of 78.24%. These results highlight the need for further development of robust agents, that current state-of-the-art large language models are far from perfect performance in these real-life tasks, and that WebArena can be used to measure such progress., Comment: Our code, data, environment reproduction resources, and video demonstrations are publicly available at https://webarena.dev/
Published: 2023

32. Pragmatic Inference with a CLIP Listener for Contrastive Captioning

Author: Ou, Jiefu, Krojer, Benno, and Fried, Daniel
Subjects: Computer Science - Computation and Language
Abstract: We propose a simple yet effective and robust method for contrastive captioning: generating discriminative captions that distinguish target images from very similar alternative distractor images. Our approach is built on a pragmatic inference procedure that formulates captioning as a reference game between a speaker, which produces possible captions describing the target, and a listener, which selects the target given the caption. Unlike previous methods that derive both speaker and listener distributions from a single captioning model, we leverage an off-the-shelf CLIP model to parameterize the listener. Compared with captioner-only pragmatic models, our method benefits from rich vision language alignment representations from CLIP when reasoning over distractors. Like previous methods for discriminative captioning, our method uses a hyperparameter to control the tradeoff between the informativity (how likely captions are to allow a human listener to discriminate the target image) and the fluency of the captions. However, we find that our method is substantially more robust to the value of this hyperparameter than past methods, which allows us to automatically optimize the captions for informativity - outperforming past methods for discriminative captioning by 11% to 15% accuracy in human evaluations, Comment: Findings of ACL 2023, fixed some references
Published: 2023

33. Time‐resolved SWIR imaging for the assessment of the activity of occlusal caries lesions

Author: Ng, Morgan, Wycoff, Spencer, Zhu, Yihua, Ho, Yi‐Ching, Takasuka, Hannah, and Fried, Daniel
Subjects: Analytical Chemistry, Medicinal and Biomolecular Chemistry, Chemical Sciences, Dental/Oral and Craniofacial Disease, 4.2 Evaluation of markers and technologies, Humans, Dehydration, Dental Caries Susceptibility, X-Ray Microtomography, Tomography, Optical Coherence, Kinetics, Dental Caries, dental caries, lesion activity, occlusal lesions, SWIR imaging, Optical Physics, Medical Biotechnology, Optoelectronics & Photonics, Analytical chemistry, Medicinal and biomolecular chemistry
Abstract: The aim of this study was to develop a clinical SWIR reflectance handpiece to assess the activity of lesions on the occlusal surfaces. The time-resolved reflectivity of 10 active and 10 arrested occlusal caries lesions on extracted teeth was monitored at 1470 nm using a benchtop system and a modified clinical prototype during forced air drying. The presence of a highly mineralized surface layer measured with microcomputed tomography (microCT) was used to indicate lesion activity. Multiple kinetic parameters were extracted from the acquired SWIR time versus intensity dehydration curves and used to assess lesion activity. Three parameters: delay, %Ifin , and rate calculated from the SWIR dehydration curves were significantly different (p
Published: 2023

34. Assessment of the activity of secondary caries lesions with short-wavelength infrared, thermal, and optical coherence tomographic imaging

Author: Chang, Nai-Yuan N, Dillas, Tina, Zhu, Yihua, and Fried, Daniel
Subjects: Biomedical and Clinical Sciences, Dentistry, Dental/Oral and Craniofacial Disease, Bioengineering, Biomedical Imaging, Infectious Diseases, 4.2 Evaluation of markers and technologies, Detection, screening and diagnosis, Humans, Tomography, Optical Coherence, Dehydration, Dental Caries Susceptibility, X-Ray Microtomography, Dental Caries, lesion activity, micro-computed tomography, optical coherence tomography, secondary caries lesions, shortwave-infrared imaging, thermal imaging, Optical Physics, Biomedical Engineering, Opthalmology and Optometry, Optics, Ophthalmology and optometry, Biomedical engineering, Atomic, molecular and optical physics
Abstract: Significance: Leakage in the interfaces between restorative materials and tooth structure allows for fluid and bacterial acid infiltration, causing restoration failure due to secondary caries. Dentists spend more time replacing composite restorations than placing new ones. Previous in vitro and in vivo studies on enamel and root surfaces using shortwave-infrared (SWIR) and thermal imaging during dehydration with forced air have been promising for assessing lesion activity. Aim: We hypothesized that SWIR reflectance and thermal imaging methods can be used to monitor the activity of secondary caries lesions around composite restorations. The objective of this study was to employ these methods to measure the rate of fluid loss from lesions during dehydration with forced air to assess lesion activity. Approach: Sixty-three extracted human teeth with total of 109 suspected secondary lesions were examined using SWIR and thermal imaging during dehydration. The thickness of the highly mineralized transparent surface layer (TSL) at lesion interfaces indicative of lesion activity was measured by optical coherence tomography (OCT). Micro-computed tomography (MicroCT) was used to further confirm lesion severity and structure. OCT and MicroCT measurements of lesion structure, depth, and severity were correlated with fluid loss rates measured with SWIR reflectance and thermal imaging. Results: TSL thickness measured with OCT correlated with both SWIR reflectance and thermal measurements of rates of fluid loss ( p < 0.05 ). Increasing TSL thickness led to decreased permeability of lesions, potentially indicating full lesion arrest at TSL ≥ 70 μ m . SWIR performed better than thermal imaging for secondary lesion activity assessment, although both methods performed best on smooth surface lesions. Conclusions: Nondestructive SWIR reflectance and OCT imaging methods are promising for clinically monitoring the activity of secondary caries lesions.
Published: 2023

35. In vitro Assessment of lesion activity using simultaneous time-resolved reflectance imaging at 1300 and 1950 nm

Author: Wycoff, Spencer, Zhu, Yihua, and Fried, Daniel
Published: 2024
Full Text: View/download PDF

36. Generating Images with Multimodal Language Models

Author: Koh, Jing Yu, Fried, Daniel, and Salakhutdinov, Ruslan
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: We propose a method to fuse frozen text-only large language models (LLMs) with pre-trained image encoder and decoder models, by mapping between their embedding spaces. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue. Ours is the first approach capable of conditioning on arbitrarily interleaved image and text inputs to generate coherent image (and text) outputs. To achieve strong performance on image generation, we propose an efficient mapping network to ground the LLM to an off-the-shelf text-to-image generation model. This mapping network translates hidden representations of text into the embedding space of the visual models, enabling us to leverage the strong text representations of the LLM for visual outputs. Our approach outperforms baseline generation models on tasks with longer and more complex language. In addition to novel image generation, our model is also capable of image retrieval from a prespecified dataset, and decides whether to retrieve or generate at inference time. This is done with a learnt decision module which conditions on the hidden representations of the LLM. Our model exhibits a wider range of capabilities compared to prior multimodal language models. It can process image-and-text inputs, and produce retrieved images, generated images, and generated text -- outperforming non-LLM based generation models across several text-to-image tasks that measure context dependence., Comment: NeurIPS 2023. Project page: http://jykoh.com/gill
Published: 2023

37. StarCoder: may the source be with you!

Author: Li, Raymond, Allal, Loubna Ben, Zi, Yangtian, Muennighoff, Niklas, Kocetkov, Denis, Mou, Chenghao, Marone, Marc, Akiki, Christopher, Li, Jia, Chim, Jenny, Liu, Qian, Zheltonozhskii, Evgenii, Zhuo, Terry Yue, Wang, Thomas, Dehaene, Olivier, Davaadorj, Mishig, Lamy-Poirier, Joel, Monteiro, João, Shliazhko, Oleh, Gontier, Nicolas, Meade, Nicholas, Zebaze, Armel, Yee, Ming-Ho, Umapathi, Logesh Kumar, Zhu, Jian, Lipkin, Benjamin, Oblokulov, Muhtasham, Wang, Zhiruo, Murthy, Rudra, Stillerman, Jason, Patel, Siva Sankalp, Abulkhanov, Dmitry, Zocca, Marco, Dey, Manan, Zhang, Zhihan, Fahmy, Nour, Bhattacharyya, Urvashi, Yu, Wenhao, Singh, Swayam, Luccioni, Sasha, Villegas, Paulo, Kunakov, Maxim, Zhdanov, Fedor, Romero, Manuel, Lee, Tony, Timor, Nadav, Ding, Jennifer, Schlesinger, Claire, Schoelkopf, Hailey, Ebert, Jan, Dao, Tri, Mishra, Mayank, Gu, Alex, Robinson, Jennifer, Anderson, Carolyn Jane, Dolan-Gavitt, Brendan, Contractor, Danish, Reddy, Siva, Fried, Daniel, Bahdanau, Dzmitry, Jernite, Yacine, Ferrandis, Carlos Muñoz, Hughes, Sean, Wolf, Thomas, Guha, Arjun, von Werra, Leandro, and de Vries, Harm
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Programming Languages, Computer Science - Software Engineering
Abstract: The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model. Furthermore, StarCoder outperforms every model that is fine-tuned on Python, can be prompted to achieve 40\% pass@1 on HumanEval, and still retains its performance on other programming languages. We take several important steps towards a safe open-access model release, including an improved PII redaction pipeline and a novel attribution tracing tool, and make the StarCoder models publicly available under a more commercially viable version of the Open Responsible AI Model license.
Published: 2023

38. Beijing's Crypto-Victorian: Traditionalist Influences on Hu Shi's Poetic Practice

Author: Fried, Daniel
Published: 2007
Full Text: View/download PDF

39. The Politics of the Coleridgean Symbol

Author: Fried, Daniel
Published: 2006
Full Text: View/download PDF

40. Exploratory Analysis of Objective Outcome Measures for the Clinical Assessment of Erosive Tooth Wear.

Author: Romero, Maria, Ungar, Peter, Lippert, Frank, Zero, Domenick, Zunt, Susan, Eckert, George, Gossweiler, Ana, Elkington-Stauss, Dylan, Tamayo-Cabeza, Guillermo, Kelly, Adam, Bartels, Troy, Kita, Camille, Wewers, Elizabeth, Hara, Anderson, and Fried, Daniel
Subjects: BEWE, dental enamel, enamel surface texture, erosive tooth wear, optical coherence tomography
Abstract: This study proposed using enamel surface texture and thickness for the objective detection and monitoring of erosive tooth wear (ETW), comparing them to the standard subjective Basic Erosive Wear Evaluation (BEWE). Thirty-two subjects (n = 597 teeth) were enrolled in this longitudinal observational clinical study. Enamel thickness (by cross-polarization optical coherence tomography, CP-OCT) and 3D dental microwear parameters, i.e., area-scale fractal complexity (Asfc), anisotropy (Str), and roughness (Sa) (by white-light scanning confocal profilometry), were obtained from buccal surfaces. Buccal, occlusal, and lingual surfaces were scored for BEWE and the maximum score per tooth (BEWEMax) was determined at baseline and 12 months (M12). Data outcome relationships were evaluated (alpha = 0.05). Enamel thickness decreased (p < 0.001), BEWE scores, Sa, and Str increased (p < 0.001), while Asfc did not change at M12. Baseline BEWEBuccal correlated strongly with BEWEMax (r = 0.86, p < 0.001) and moderately with BEWELingual (r = 0.42, p < 0.001), but not with enamel thickness (r = 0.03, p = 0.43). Change (Δ) in surface texture outcomes correlated poorly but significantly with ΔBEWEBuccal (r = -0.15-0.16, p < 0.001) and did not correlate with Δenamel thickness (r = 0.02-0.09, p > 0.06). Teeth with BEWE progression revealed a greater increase in ΔSa and ΔStr. These findings suggest that enamel surface roughness can potentially determine ETW severity, and CP-OCT may be relevant for clinically monitoring enamel thickness.
Published: 2023

41. Monitoring lesion activity on primary teeth with CP‐OCT and SWIR reflectance imaging

Author: Zhu, Yihua, Kim, Jungsoo, Lin, Brent, and Fried, Daniel
Subjects: Biomedical and Clinical Sciences, Dentistry, Dental/Oral and Craniofacial Disease, Biomedical Imaging, Clinical Research, Humans, Fluorides, Tomography, Optical Coherence, Tooth Demineralization, Tooth, Deciduous, caries detection, caries diagnosis, dental caries, lesion activity, optical coherence tomography, SWIR imaging, Clinical Sciences, Dermatology & Venereal Diseases, Clinical sciences
Abstract: ObjectiveThe purpose of this study was to use cross polarization optical coherence tomography (CP-OCT) and short wavelength infrared imaging (SWIR) reflectance imaging to monitor changes in the structure and activity of early occlusal caries on primary teeth over a period of 6 months during intervention with fluoride.MethodsParticipants (n = 29) aged 6-10 each with two suspected active occlusal lesions on primary teeth completed the study. Fluoride varnish was applied to tooth surfaces every 3-months and participants were instructed to brush twice daily with a fluoride toothpaste. Images were acquired using CP-OCT every 3 months for 6 months. SWIR reflectance images were acquired during forced air-drying of the lesions for 30 s at 0 and 6-months.ResultsMost of the 42 lesions appeared initially active at baseline. Only 6 lesions appeared arrested at baseline based on the presence of a highly mineralized transparent surface layer (TSL) in CP-OCT images. At 6 months, 14 of the lesions appeared arrested including the 6 initially arrested lesions and the TSL thickness increased significantly (p
Published: 2023

42. Grounding Language Models to Images for Multimodal Inputs and Outputs

Author: Koh, Jing Yu, Salakhutdinov, Ruslan, and Fried, Daniel
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: We propose an efficient method to ground pretrained text-only language models to the visual domain, enabling them to process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images. Our method leverages the abilities of language models learnt from large scale text-only pretraining, such as in-context learning and free-form text generation. We keep the language model frozen, and finetune input and output linear layers to enable cross-modality interactions. This allows our model to process arbitrarily interleaved image-and-text inputs, and generate free-form text interleaved with retrieved images. We achieve strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue, and showcase compelling interactive abilities. Our approach works with any off-the-shelf language model and paves the way towards an effective, general solution for leveraging pretrained language models in visually grounded settings., Comment: Published in ICML 2023. Project page: https://jykoh.com/fromage
Published: 2023

43. SantaCoder: don't reach for the stars!

Author: Allal, Loubna Ben, Li, Raymond, Kocetkov, Denis, Mou, Chenghao, Akiki, Christopher, Ferrandis, Carlos Munoz, Muennighoff, Niklas, Mishra, Mayank, Gu, Alex, Dey, Manan, Umapathi, Logesh Kumar, Anderson, Carolyn Jane, Zi, Yangtian, Poirier, Joel Lamy, Schoelkopf, Hailey, Troshin, Sergey, Abulkhanov, Dmitry, Romero, Manuel, Lappert, Michael, De Toni, Francesco, del Río, Bernardo García, Liu, Qian, Bose, Shamik, Bhattacharyya, Urvashi, Zhuo, Terry Yue, Yu, Ian, Villegas, Paulo, Zocca, Marco, Mangrulkar, Sourab, Lansky, David, Nguyen, Huu, Contractor, Danish, Villa, Luis, Li, Jia, Bahdanau, Dzmitry, Jernite, Yacine, Hughes, Sean, Fried, Daniel, Guha, Arjun, de Vries, Harm, and von Werra, Leandro
Subjects: Computer Science - Software Engineering, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: The BigCode project is an open-scientific collaboration working on the responsible development of large language models for code. This tech report describes the progress of the collaboration until December 2022, outlining the current state of the Personally Identifiable Information (PII) redaction pipeline, the experiments conducted to de-risk the model architecture, and the experiments investigating better preprocessing methods for the training data. We train 1.1B parameter models on the Java, JavaScript, and Python subsets of The Stack and evaluate them on the MultiPL-E text-to-code benchmark. We find that more aggressive filtering of near-duplicates can further boost performance and, surprisingly, that selecting files from repositories with 5+ GitHub stars deteriorates performance significantly. Our best model outperforms previous open-source multilingual code generation models (InCoder-6.7B and CodeGen-Multi-2.7B) in both left-to-right generation and infilling on the Java, JavaScript, and Python portions of MultiPL-E, despite being a substantially smaller model. All models are released under an OpenRAIL license at https://hf.co/bigcode.
Published: 2023

44. The Interagency Foreign Policy Process

Author: Fried, Daniel, primary and Tefft, John, additional
Published: 2024
Full Text: View/download PDF

45. Execution-Based Evaluation for Open-Domain Code Generation

Author: Wang, Zhiruo, Zhou, Shuyan, Fried, Daniel, and Neubig, Graham
Subjects: Computer Science - Software Engineering, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: To extend the scope of coding queries to more realistic settings, we propose ODEX, the first Open-Domain EXecution-based natural language (NL) to Python code generation dataset. ODEX has 945 NL-Code pairs spanning 79 diverse libraries, along with 1,707 human-written test cases for execution. Our NL-Code pairs are harvested from StackOverflow forums to encourage natural and practical coding queries. Moreover, ODEX supports four natural languages as intents, in English, Spanish, Japanese, and Russian. ODEX unveils intriguing behavioral differences among top-performing code language models (LM). While CODEX achieves better overall results, CODEGEN improves effectively via scaling -- CODEGEN 6.1B performs comparably with CODEX 12B. Both models show substantial gaps between open and closed domains, but CODEGEN gaps tend to decrease with model size while CODEX gaps increase. We release ODEX to facilitate research into open-domain problems for the code generation community.
Published: 2022

46. Coder Reviewer Reranking for Code Generation

Author: Zhang, Tianyi, Yu, Tao, Hashimoto, Tatsunori B., Lewis, Mike, Yih, Wen-tau, Fried, Daniel, and Wang, Sida I.
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Programming Languages, Computer Science - Software Engineering
Abstract: Sampling diverse programs from a code language model and reranking with model likelihood is a popular method for code generation but it is prone to preferring degenerate solutions. Inspired by collaborative programming, we propose Coder-Reviewer reranking. We augment Coder language models from past work, which generate programs given language instructions, with Reviewer models, which evaluate the likelihood of the instruction given the generated programs. We perform an extensive study across six datasets with eight models from three model families. Experimental results show that Coder-Reviewer reranking leads to consistent and significant improvement (up to 17% absolute accuracy gain) over reranking with the Coder model only. When combined with executability filtering, Coder-Reviewer reranking can often outperform the minimum Bayes risk method. Coder-Reviewer reranking is easy to implement by prompting, can generalize to different programming languages, and works well with off-the-shelf hyperparameters.
Published: 2022

47. G^3: Geolocation via Guidebook Grounding

Author: Luo, Grace, Biamby, Giscard, Darrell, Trevor, Fried, Daniel, and Rohrbach, Anna
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: We demonstrate how language can improve geolocation: the task of predicting the location where an image was taken. Here we study explicit knowledge from human-written guidebooks that describe the salient and class-discriminative visual features humans use for geolocation. We propose the task of Geolocation via Guidebook Grounding that uses a dataset of StreetView images from a diverse set of locations and an associated textual guidebook for GeoGuessr, a popular interactive geolocation game. Our approach predicts a country for each image by attending over the clues automatically extracted from the guidebook. Supervising attention with country-level pseudo labels achieves the best performance. Our approach substantially outperforms a state-of-the-art image-only geolocation method, with an improvement of over 5% in Top-1 accuracy. Our dataset and code can be found at https://github.com/g-luo/geolocation_via_guidebook_grounding., Comment: Findings of EMNLP 2022
Published: 2022

48. AutoReply: Detecting Nonsense in Dialogue Introspectively with Discriminative Replies

Author: Shi, Weiyan, Dinan, Emily, Renduchintala, Adi, Fried, Daniel, Jacob, Athul Paul, Yu, Zhou, and Lewis, Mike
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Existing approaches built separate classifiers to detect nonsense in dialogues. In this paper, we show that without external classifiers, dialogue models can detect errors in their own messages introspectively, by calculating the likelihood of replies that are indicative of poor messages. For example, if an agent believes its partner is likely to respond "I don't understand" to a candidate message, that message may not make sense, so an alternative message should be chosen. We evaluate our approach on a dataset from the game Diplomacy, which contains long dialogues richly grounded in the game state, on which existing models make many errors. We first show that hand-crafted replies can be effective for the task of detecting nonsense in applications as complex as Diplomacy. We then design AutoReply, an algorithm to search for such discriminative replies automatically, given a small number of annotated dialogue examples. We find that AutoReply-generated replies outperform handcrafted replies and perform on par with carefully fine-tuned large supervised models. Results also show that one single reply without much computation overheads can also detect dialogue nonsense reasonably well.
Published: 2022

49. DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation

Author: Lai, Yuhang, Li, Chengxi, Wang, Yiming, Zhang, Tianyi, Zhong, Ruiqi, Zettlemoyer, Luke, Yih, Scott Wen-tau, Fried, Daniel, Wang, Sida, and Yu, Tao
Subjects: Computer Science - Software Engineering, Computer Science - Computation and Language
Abstract: We introduce DS-1000, a code generation benchmark with a thousand data science problems spanning seven Python libraries, such as NumPy and Pandas. Compared to prior works, DS-1000 incorporates three core features. First, our problems reflect diverse, realistic, and practical use cases since we collected them from StackOverflow. Second, our automatic evaluation is highly specific (reliable) -- across all Codex-002-predicted solutions that our evaluation accept, only 1.8% of them are incorrect; we achieve this with multi-criteria metrics, checking both functional correctness by running test cases and surface-form constraints by restricting API usages or keywords. Finally, we proactively defend against memorization by slightly modifying our problems to be different from the original StackOverflow source; consequently, models cannot answer them correctly by memorizing the solutions from pre-training. The current best public system (Codex-002) achieves 43.3% accuracy, leaving ample room for improvement. We release our benchmark at https://ds1000-code-gen.github.io.
Published: 2022

50. Pragmatics in Language Grounding: Phenomena, Tasks, and Modeling Approaches

Author: Fried, Daniel, Tomlin, Nicholas, Hu, Jennifer, Patel, Roma, and Nematzadeh, Aida
Subjects: Computer Science - Computation and Language
Abstract: People rely heavily on context to enrich meaning beyond what is literally said, enabling concise but effective communication. To interact successfully and naturally with people, user-facing artificial intelligence systems will require similar skills in pragmatics: relying on various types of context -- from shared linguistic goals and conventions, to the visual and embodied world -- to use language effectively. We survey existing grounded settings and pragmatic modeling approaches and analyze how the task goals, environmental contexts, and communicative affordances in each work enrich linguistic meaning. We present recommendations for future grounded task design to naturally elicit pragmatic phenomena, and suggest directions that focus on a broader range of communicative contexts and affordances., Comment: Findings of EMNLP 2023
Published: 2022

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

1,580 results on '"Fried, Daniel"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources