Author: "Artzi, Yoav" / Database: OAIster - Searchworks@Jio Institute Digital Library Search Results

1. A Surprising Failure? Multimodal LLMs and the NLVR Challenge

Author: Wu, Anne, Brantley, Kianté, Artzi, Yoav, Wu, Anne, Brantley, Kianté, and Artzi, Yoav
Abstract: This study evaluates three state-of-the-art MLLMs -- GPT-4V, Gemini Pro, and the open-source model IDEFICS -- on the compositional natural language vision reasoning task NLVR. Given a human-written sentence paired with a synthetic image, this task requires the model to determine the truth value of the sentence with respect to the image. Despite the strong performance demonstrated by these models, we observe they perform poorly on NLVR, which was constructed to require compositional and spatial reasoning, and to be robust for semantic and systematic biases.
Published: 2024

2. Continual Learning for Instruction Following from Realtime Feedback

Author: Suhr, Alane, Suhr, Alane, Artzi, Yoav, Suhr, Alane, Suhr, Alane, and Artzi, Yoav
Published: 2023

3. Semantic uncertainty guides the extension of conventions to new referents

Author: Eliav, Ron, Eliav, Ron, Ji, Anya, Artzi, Yoav, Hawkins, Robert, Eliav, Ron, Eliav, Ron, Ji, Anya, Artzi, Yoav, and Hawkins, Robert
Abstract: A long tradition of studies in psycholinguistics has examined the formation and generalization of ad hoc conventions in reference games, showing how newly acquired conventions for a given target transfer to new referential contexts. However, another axis of generalization remains understudied: how do conventions formed for one target transfer to completely distinct targets, when specific lexical choices are unlikely to repeat? This paper presents two dyadic studies (N=240) that address this axis of generalization, focusing on the role of nameability --- the a priori likelihood that two individuals will share the same label. We leverage the recently-released KiloGram dataset, a collection of abstract tangram images that is orders of magnitude larger than previously available, exhibiting high diversity of properties like nameability. Our first study asks how nameability shapes convention formation, while the second asks how new conventions generalize to entirely new targets of reference. Our results raise new questions about how ad hoc conventions extend beyond target-specific re-use of specific lexical choices.
Published: 2023

4. CB2: Collaborative Natural Language Interaction Research Platform

Author: Sharf, Jacob, Gul, Mustafa Omer, Artzi, Yoav, Sharf, Jacob, Gul, Mustafa Omer, and Artzi, Yoav
Abstract: CB2 is a multi-agent platform to study collaborative natural language interaction in a grounded task-oriented scenario. It includes a 3D game environment, a backend server designed to serve trained models to human agents, and various tools and processes to enable scalable studies. We deploy CB2 at https://cb2.ai as a system demonstration with a learned instruction following model., Comment: ACL 2023 Demo paper
Published: 2023

5. Continually Improving Extractive QA via Human Feedback

Author: Gao, Ge, Chen, Hung-Ting, Artzi, Yoav, Choi, Eunsol, Gao, Ge, Chen, Hung-Ting, Artzi, Yoav, and Choi, Eunsol
Abstract: We study continually improving an extractive question answering (QA) system via human user feedback. We design and deploy an iterative approach, where information-seeking users ask questions, receive model-predicted answers, and provide feedback. We conduct experiments involving thousands of user interactions under diverse setups to broaden the understanding of learning from feedback over time. Our experiments show effective improvement from user feedback of extractive QA models over time across different data regimes, including significant potential for domain adaptation., Comment: EMNLP 2023
Published: 2023

6. Semantic uncertainty guides the extension of conventions to new referents

Author: Eliav, Ron, Ji, Anya, Artzi, Yoav, Hawkins, Robert D., Eliav, Ron, Ji, Anya, Artzi, Yoav, and Hawkins, Robert D.
Abstract: A long tradition of studies in psycholinguistics has examined the formation and generalization of ad hoc conventions in reference games, showing how newly acquired conventions for a given target transfer to new referential contexts. However, another axis of generalization remains understudied: how do conventions formed for one target transfer to completely distinct targets, when specific lexical choices are unlikely to repeat? This paper presents two dyadic studies (N = 240) that address this axis of generalization, focusing on the role of nameability -- the a priori likelihood that two individuals will share the same label. We leverage the recently-released KiloGram dataset, a collection of abstract tangram images that is orders of magnitude larger than previously available, exhibiting high diversity of properties like nameability. Our first study asks how nameability shapes convention formation, while the second asks how new conventions generalize to entirely new targets of reference. Our results raise new questions about how ad hoc conventions extend beyond target-specific re-use of specific lexical choices., Comment: Proceedings of the 45th Annual Conference of the Cognitive Science Society
Published: 2023

7. SteP: Stacked LLM Policies for Web Actions

Author: Sodhi, Paloma, Branavan, S. R. K., Artzi, Yoav, McDonald, Ryan, Sodhi, Paloma, Branavan, S. R. K., Artzi, Yoav, and McDonald, Ryan
Abstract: Performing tasks on the web presents fundamental challenges to large language models (LLMs), including combinatorially large open-world tasks and variations across web interfaces. Simply specifying a large prompt to handle all possible behaviors and states is extremely complex, and results in behavior leaks between unrelated behaviors. Decomposition to distinct policies can address this challenge, but requires carefully handing off control between policies. We propose Stacked LLM Policies for Web Actions (SteP), an approach to dynamically compose policies to solve a diverse set of web tasks. SteP defines a Markov Decision Process where the state is a stack of policies representing the control state, i.e., the chain of policy calls. Unlike traditional methods that are restricted to static hierarchies, SteP enables dynamic control that adapts to the complexity of the task. We evaluate SteP against multiple baselines and web environments including WebArena, MiniWoB++, and a CRM simulator. On WebArena, SteP improves (14.9% to 35.8%) over SOTA that use GPT-4 policies, while on MiniWob++, SteP is competitive with prior works while using significantly less data. Our code and data is available at https://asappresearch.github.io/webagents-step., Comment: 30 pages, 15 figures
Published: 2023

8. A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models

Author: Kojima, Noriyuki, Averbuch-Elor, Hadar, Artzi, Yoav, Kojima, Noriyuki, Averbuch-Elor, Hadar, and Artzi, Yoav
Abstract: Key to tasks that require reasoning about natural language in visual contexts is grounding words and phrases to image regions. However, observing this grounding in contemporary models is complex, even if it is generally expected to take place if the task is addressed in a way that is conductive to generalization. We propose a framework to jointly study task performance and phrase grounding, and propose three benchmarks to study the relation between the two. Our results show that contemporary models demonstrate inconsistency between their ability to ground phrases and solve tasks. We show how this can be addressed through brute-force training on ground phrasing annotations, and analyze the dynamics it creates. Code and at available at https://github.com/lil-lab/phrase_grounding., Comment: This was published in TMLR in 2024, on January 24th
Published: 2023

9. IncDSI: Incrementally Updatable Document Retrieval

Author: Kishore, Varsha, Wan, Chao, Lovelace, Justin, Artzi, Yoav, Weinberger, Kilian Q., Kishore, Varsha, Wan, Chao, Lovelace, Justin, Artzi, Yoav, and Weinberger, Kilian Q.
Abstract: Differentiable Search Index is a recently proposed paradigm for document retrieval, that encodes information about a corpus of documents within the parameters of a neural network and directly maps queries to corresponding documents. These models have achieved state-of-the-art performances for document retrieval across many benchmarks. These kinds of models have a significant limitation: it is not easy to add new documents after a model is trained. We propose IncDSI, a method to add documents in real time (about 20-50ms per document), without retraining the model on the entire dataset (or even parts thereof). Instead we formulate the addition of documents as a constrained optimization problem that makes minimal changes to the network parameters. Although orders of magnitude faster, our approach is competitive with re-training the model on the whole dataset and enables the development of document retrieval systems that can be updated with new information in real-time. Our code for IncDSI is available at https://github.com/varshakishore/IncDSI.
Published: 2023

10. Continual Learning for Instruction Following from Realtime Feedback

Author: Suhr, Alane, Artzi, Yoav, Suhr, Alane, and Artzi, Yoav
Abstract: We propose and deploy an approach to continually train an instruction-following agent from feedback provided by users during collaborative interactions. During interaction, human users instruct an agent using natural language, and provide realtime binary feedback as they observe the agent following their instructions. We design a contextual bandit learning approach, converting user feedback to immediate reward. We evaluate through thousands of human-agent interactions, demonstrating 15.4% absolute improvement in instruction execution accuracy over time. We also show our approach is robust to several design variations, and that the feedback signal is roughly equivalent to the learning signal of supervised demonstration data., Comment: NeurIPS 2023 Spotlight paper
Published: 2022

11. Abstract Visual Reasoning with Tangram Shapes

Author: Ji, Anya, Kojima, Noriyuki, Rush, Noah, Suhr, Alane, Vong, Wai Keen, Hawkins, Robert D., Artzi, Yoav, Ji, Anya, Kojima, Noriyuki, Rush, Noah, Suhr, Alane, Vong, Wai Keen, Hawkins, Robert D., and Artzi, Yoav
Abstract: We introduce KiloGram, a resource for studying abstract visual reasoning in humans and machines. Drawing on the history of tangram puzzles as stimuli in cognitive science, we build a richly annotated dataset that, with >1k distinct stimuli, is orders of magnitude larger and more diverse than prior resources. It is both visually and linguistically richer, moving beyond whole shape descriptions to include segmentation maps and part labels. We use this resource to evaluate the abstract visual reasoning capacities of recent multi-modal models. We observe that pre-trained weights demonstrate limited abstract reasoning, which dramatically improves with fine-tuning. We also observe that explicitly describing parts aids abstract reasoning for both humans and models, especially when jointly encoding the linguistic and visual inputs. KiloGram is available at https://lil.nlp.cornell.edu/kilogram ., Comment: EMNLP 2022 long paper
Published: 2022

12. lilGym: Natural Language Visual Reasoning with Reinforcement Learning

Author: Wu, Anne, Brantley, Kianté, Kojima, Noriyuki, Artzi, Yoav, Wu, Anne, Brantley, Kianté, Kojima, Noriyuki, and Artzi, Yoav
Abstract: We present lilGym, a new benchmark for language-conditioned reinforcement learning in visual environments. lilGym is based on 2,661 highly-compositional human-written natural language statements grounded in an interactive visual environment. We introduce a new approach for exact reward computation in every possible world state by annotating all statements with executable Python programs. Each statement is paired with multiple start states and reward functions to form thousands of distinct Markov Decision Processes of varying difficulty. We experiment with lilGym with different models and learning regimes. Our results and analysis show that while existing methods are able to achieve non-trivial performance, lilGym forms a challenging open problem. lilGym is available at https://lil.nlp.cornell.edu/lilgym/., Comment: ACL 2023 Long Paper
Published: 2022

13. Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages

Author: Wu, Felix, Kim, Kwangyoun, Watanabe, Shinji, Han, Kyu, McDonald, Ryan, Weinberger, Kilian Q., Artzi, Yoav, Wu, Felix, Kim, Kwangyoun, Watanabe, Shinji, Han, Kyu, McDonald, Ryan, Weinberger, Kilian Q., and Artzi, Yoav
Abstract: We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data. We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task -- transcribing audio inputs into pseudo subword sequences. This process stands on its own, or can be applied as low-cost second-stage pre-training. We experiment with automatic speech recognition (ASR), spoken named entity recognition, and speech-to-text translation. We set new state-of-the-art results for end-to-end spoken named entity recognition, and show consistent improvements on 20 language pairs for speech-to-text translation, even when competing methods use additional text data for training. Finally, on ASR, our approach enables encoder-decoder methods to benefit from pre-training for all parts of the network, and shows comparable performance to highly optimized recent methods., Comment: Code available at https://github.com/asappresearch/wav2seq
Published: 2022

14. Simulating Bandit Learning from User Feedback for Extractive Question Answering

Author: Gao, Ge, Choi, Eunsol, Artzi, Yoav, Gao, Ge, Choi, Eunsol, and Artzi, Yoav
Abstract: We study learning from user feedback for extractive question answering by simulating feedback using supervised data. We cast the problem as contextual bandit learning, and analyze the characteristics of several learning scenarios with focus on reducing data annotation. We show that systems initially trained on a small number of examples can dramatically improve given feedback from users on model-predicted answers, and that one can use existing datasets to deploy systems in new domains without any annotation, but instead improving the system on-the-fly via user feedback., Comment: ACL 2022
Published: 2022

15. SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation on Natural Speech

Author: Shon, Suwon, Pasad, Ankita, Wu, Felix, Brusco, Pablo, Artzi, Yoav, Livescu, Karen, Han, Kyu J., Shon, Suwon, Pasad, Ankita, Wu, Felix, Brusco, Pablo, Artzi, Yoav, Livescu, Karen, and Han, Kyu J.
Abstract: Progress in speech processing has been facilitated by shared datasets and benchmarks. Historically these have focused on automatic speech recognition (ASR), speaker identification, or other lower-level tasks. Interest has been growing in higher-level spoken language understanding tasks, including using end-to-end models, but there are fewer annotated datasets for such tasks. At the same time, recent work shows the possibility of pre-training generic representations and then fine-tuning for several tasks using relatively little labeled data. We propose to create a suite of benchmark tasks for Spoken Language Understanding Evaluation (SLUE) consisting of limited-size labeled training sets and corresponding evaluation sets. This resource would allow the research community to track progress, evaluate pre-trained representations for higher-level tasks, and study open questions such as the utility of pipeline versus end-to-end approaches. We present the first phase of the SLUE benchmark suite, consisting of named entity recognition, sentiment analysis, and ASR on the corresponding datasets. We focus on naturally produced (not read or synthesized) speech, and freely available datasets. We provide new transcriptions and annotations on subsets of the VoxCeleb and VoxPopuli datasets, evaluation metrics and results for baseline models, and an open-source toolkit to reproduce the baselines and evaluate new models., Comment: Updated preprint for SLUE Benchmark v0.2; Toolkit link https://github.com/asappresearch/slue-toolkit
Published: 2021

16. When in Doubt: Improving Classification Performance with Alternating Normalization

Author: Jia, Menglin, Reiter, Austin, Lim, Ser-Nam, Artzi, Yoav, Cardie, Claire, Jia, Menglin, Reiter, Austin, Lim, Ser-Nam, Artzi, Yoav, and Cardie, Claire
Abstract: We introduce Classification with Alternating Normalization (CAN), a non-parametric post-processing step for classification. CAN improves classification accuracy for challenging examples by re-adjusting their predicted class probability distribution using the predicted class distributions of high-confidence validation examples. CAN is easily applicable to any probabilistic classifier, with minimal computation overhead. We analyze the properties of CAN using simulated experiments, and empirically demonstrate its effectiveness across a diverse set of classification tasks., Comment: Findings of EMNLP 2021
Published: 2021

17. Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

Author: Wu, Felix, Kim, Kwangyoun, Pan, Jing, Han, Kyu, Weinberger, Kilian Q., Artzi, Yoav, Wu, Felix, Kim, Kwangyoun, Pan, Jing, Han, Kyu, Weinberger, Kilian Q., and Artzi, Yoav
Abstract: This paper is a study of performance-efficiency trade-offs in pre-trained models for automatic speech recognition (ASR). We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency. Putting together all our observations, we introduce SEW (Squeezed and Efficient Wav2vec), a pre-trained model architecture with significant improvements along both performance and efficiency dimensions across a variety of training setups. For example, under the 100h-960h semi-supervised setup on LibriSpeech, SEW achieves a 1.9x inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in word error rate. With a similar inference time, SEW reduces word error rate by 25-50% across different model sizes., Comment: Code available at https://github.com/asappresearch/sew
Published: 2021

18. Analysis of Language Change in Collaborative Instruction Following

Author: Effenberger, Anna, Yan, Eva, Singh, Rhia, Suhr, Alane, Artzi, Yoav, Effenberger, Anna, Yan, Eva, Singh, Rhia, Suhr, Alane, and Artzi, Yoav
Abstract: We analyze language change over time in a collaborative, goal-oriented instructional task, where utility-maximizing participants form conventions and increase their expertise. Prior work studied such scenarios mostly in the context of reference games, and consistently found that language complexity is reduced along multiple dimensions, such as utterance length, as conventions are formed. In contrast, we find that, given the ability to increase instruction utility, instructors increase language complexity along these previously studied dimensions to better collaborate with increasingly skilled instruction followers., Comment: Findings of EMNLP 2021 Short Paper
Published: 2021

19. Who's Waldo? Linking People Across Text and Images

Author: Cui, Claire Yuqing, Khandelwal, Apoorv, Artzi, Yoav, Snavely, Noah, Averbuch-Elor, Hadar, Cui, Claire Yuqing, Khandelwal, Apoorv, Artzi, Yoav, Snavely, Noah, and Averbuch-Elor, Hadar
Abstract: We present a task and benchmark dataset for person-centric visual grounding, the problem of linking between people named in a caption and people pictured in an image. In contrast to prior work in visual grounding, which is predominantly object-based, our new task masks out the names of people in captions in order to encourage methods trained on such image-caption pairs to focus on contextual cues (such as rich interactions between multiple people), rather than learning associations between names and appearances. To facilitate this task, we introduce a new dataset, Who's Waldo, mined automatically from image-caption data on Wikimedia Commons. We propose a Transformer-based method that outperforms several strong baselines on this task, and are releasing our data to the research community to spur work on contextual models that consider both vision and language., Comment: Published in ICCV 2021 (Oral). Project webpage: https://whoswaldo.github.io
Published: 2021

20. Continual Learning for Grounded Instruction Generation by Observing Human Following Behavior

Author: Kojima, Noriyuki, Suhr, Alane, Artzi, Yoav, Kojima, Noriyuki, Suhr, Alane, and Artzi, Yoav
Abstract: We study continual learning for natural language instruction generation, by observing human users' instruction execution. We focus on a collaborative scenario, where the system both acts and delegates tasks to human users using natural language. We compare user execution of generated instructions to the original system intent as an indication to the system's success communicating its intent. We show how to use this signal to improve the system's ability to generate instructions via contextual bandit learning. In interaction with real users, our system demonstrates dramatic improvements in its ability to generate language over time., Comment: To appear in TACL 2021. The arXiv version is a pre-MIT Press publication version
Published: 2021

21. A Persistent Spatial Semantic Representation for High-level Natural Language Instruction Execution

Author: Blukis, Valts, Paxton, Chris, Fox, Dieter, Garg, Animesh, Artzi, Yoav, Blukis, Valts, Paxton, Chris, Fox, Dieter, Garg, Animesh, and Artzi, Yoav
Abstract: Natural language provides an accessible and expressive interface to specify long-term tasks for robotic agents. However, non-experts are likely to specify such tasks with high-level instructions, which abstract over specific robot actions through several layers of abstraction. We propose that key to bridging this gap between language and robot actions over long execution horizons are persistent representations. We propose a persistent spatial semantic representation method, and show how it enables building an agent that performs hierarchical reasoning to effectively execute long-term tasks. We evaluate our approach on the ALFRED benchmark and achieve state-of-the-art results, despite completely avoiding the commonly used step-by-step instructions., Comment: Presented at CoRL 2021
Published: 2021

22. Revisiting Few-sample BERT Fine-tuning

Author: Zhang, Tianyi, Wu, Felix, Katiyar, Arzoo, Weinberger, Kilian Q., Artzi, Yoav, Zhang, Tianyi, Wu, Felix, Katiyar, Arzoo, Weinberger, Kilian Q., and Artzi, Yoav
Abstract: This paper is a study of fine-tuning of BERT contextual representations, with focus on commonly observed instabilities in few-sample scenarios. We identify several factors that cause this instability: the common use of a non-standard optimization method with biased gradient estimation; the limited applicability of significant parts of the BERT network for down-stream tasks; and the prevalent practice of using a pre-determined, and small number of training iterations. We empirically test the impact of these factors, and identify alternative practices that resolve the commonly observed instability of the process. In light of these observations, we re-visit recently proposed methods to improve few-sample fine-tuning with BERT and re-evaluate their effectiveness. Generally, we observe the impact of these methods diminishes significantly with our modified process., Comment: Code available at https://github.com/asappresearch/revisit-bert-finetuning
Published: 2020

23. What is Learned in Visually Grounded Neural Syntax Acquisition

Author: Kojima, Noriyuki, Averbuch-Elor, Hadar, Rush, Alexander M., Artzi, Yoav, Kojima, Noriyuki, Averbuch-Elor, Hadar, Rush, Alexander M., and Artzi, Yoav
Abstract: Visual features are a promising signal for learning bootstrap textual models. However, blackbox learning models make it difficult to isolate the specific contribution of visual components. In this analysis, we consider the case study of the Visually Grounded Neural Syntax Learner (Shi et al., 2019), a recent approach for learning syntax from a visual training signal. By constructing simplified versions of the model, we isolate the core factors that yield the model's strong performance. Contrary to what the model might be capable of learning, we find significantly less expressive versions produce similar predictions and perform just as well, or even better. We also find that a simple lexical signal of noun concreteness plays the main role in the model's predictions as opposed to more complex syntactic reasoning., Comment: In ACL 2020
Published: 2020

24. Evaluating Models' Local Decision Boundaries via Contrast Sets

Author: Gardner, Matt, Artzi, Yoav, Basmova, Victoria, Berant, Jonathan, Bogin, Ben, Chen, Sihao, Dasigi, Pradeep, Dua, Dheeru, Elazar, Yanai, Gottumukkala, Ananth, Gupta, Nitish, Hajishirzi, Hanna, Ilharco, Gabriel, Khashabi, Daniel, Lin, Kevin, Liu, Jiangming, Liu, Nelson F., Mulcaire, Phoebe, Ning, Qiang, Singh, Sameer, Smith, Noah A., Subramanian, Sanjay, Tsarfaty, Reut, Wallace, Eric, Zhang, Ally, Zhou, Ben, Gardner, Matt, Artzi, Yoav, Basmova, Victoria, Berant, Jonathan, Bogin, Ben, Chen, Sihao, Dasigi, Pradeep, Dua, Dheeru, Elazar, Yanai, Gottumukkala, Ananth, Gupta, Nitish, Hajishirzi, Hanna, Ilharco, Gabriel, Khashabi, Daniel, Lin, Kevin, Liu, Jiangming, Liu, Nelson F., Mulcaire, Phoebe, Ning, Qiang, Singh, Sameer, Smith, Noah A., Subramanian, Sanjay, Tsarfaty, Reut, Wallace, Eric, Zhang, Ally, and Zhou, Ben
Abstract: Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture a dataset's intended capabilities. We propose a new annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Contrast sets provide a local view of a model's decision boundary, which can be used to more accurately evaluate a model's true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP reading comprehension, UD parsing, IMDb sentiment analysis). Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets---up to 25\% in some cases. We release our contrast sets as new evaluation benchmarks and encourage future dataset construction efforts to follow similar annotation processes.
Published: 2020

25. Retouchdown: Adding Touchdown to StreetLearn as a Shareable Resource for Language Grounding Tasks in Street View

Author: Mehta, Harsh, Artzi, Yoav, Baldridge, Jason, Ie, Eugene, Mirowski, Piotr, Mehta, Harsh, Artzi, Yoav, Baldridge, Jason, Ie, Eugene, and Mirowski, Piotr
Abstract: The Touchdown dataset (Chen et al., 2019) provides instructions by human annotators for navigation through New York City streets and for resolving spatial descriptions at a given location. To enable the wider research community to work effectively with the Touchdown tasks, we are publicly releasing the 29k raw Street View panoramas needed for Touchdown. We follow the process used for the StreetLearn data release (Mirowski et al., 2019) to check panoramas for personally identifiable information and blur them as necessary. These have been added to the StreetLearn dataset and can be obtained via the same process as used previously for StreetLearn. We also provide a reference implementation for both of the Touchdown tasks: vision and language navigation (VLN) and spatial description resolution (SDR). We compare our model results to those given in Chen et al. (2019) and show that the panoramas we have added to StreetLearn fully support both Touchdown tasks and can be used effectively for further research and comparison.
Published: 2020

26. Few-shot Object Grounding and Mapping for Natural Language Robot Instruction Following

Author: Blukis, Valts, Knepper, Ross A., Artzi, Yoav, Blukis, Valts, Knepper, Ross A., and Artzi, Yoav
Abstract: We study the problem of learning a robot policy to follow natural language instructions that can be easily extended to reason about new objects. We introduce a few-shot language-conditioned object grounding method trained from augmented reality data that uses exemplars to identify objects and align them to their mentions in instructions. We present a learned map representation that encodes object locations and their instructed use, and construct it from our few-shot grounding output. We integrate this mapping approach into an instruction-following policy, thereby allowing it to reason about previously unseen objects at test-time by simply adding exemplars. We evaluate on the task of learning to map raw observations and instructions to continuous control of a physical quadcopter. Our approach significantly outperforms the prior state of the art in the presence of new objects, even when the prior approach observes all objects during training., Comment: 4th Conference on Robot Learning (CoRL 2020), Cambridge MA, USA
Published: 2020

27. Interactive Classification by Asking Informative Questions

Author: Yu, Lili, Chen, Howard, Wang, Sida, Lei, Tao, Artzi, Yoav, Yu, Lili, Chen, Howard, Wang, Sida, Lei, Tao, and Artzi, Yoav
Abstract: We study the potential for interaction in natural language classification. We add a limited form of interaction for intent classification, where users provide an initial query using natural language, and the system asks for additional information using binary or multi-choice questions. At each turn, our system decides between asking the most informative question or making the final classification prediction.The simplicity of the model allows for bootstrapping of the system without interaction data, instead relying on simple crowdsourcing tasks. We evaluate our approach on two domains, showing the benefit of interaction and the advantage of learning to balance between asking additional questions and making the final prediction., Comment: Accepted at ACL 2020
Published: 2019

28. Learning to Map Natural Language Instructions to Physical Quadcopter Control using Simulated Flight

Author: Blukis, Valts, Terme, Yannick, Niklasson, Eyvind, Knepper, Ross A., Artzi, Yoav, Blukis, Valts, Terme, Yannick, Niklasson, Eyvind, Knepper, Ross A., and Artzi, Yoav
Abstract: We propose a joint simulation and real-world learning framework for mapping navigation instructions and raw first-person observations to continuous control. Our model estimates the need for environment exploration, predicts the likelihood of visiting environment positions during execution, and controls the agent to both explore and visit high-likelihood positions. We introduce Supervised Reinforcement Asynchronous Learning (SuReAL). Learning uses both simulation and real environments without requiring autonomous flight in the physical environment during training, and combines supervised learning for predicting positions to visit and reinforcement learning for continuous control. We evaluate our approach on a natural language instruction-following task with a physical quadcopter, and demonstrate effective execution and exploration behavior., Comment: Conference on Robot Learning (CoRL) 2019
Published: 2019

29. Executing Instructions in Situated Collaborative Interactions

Author: Suhr, Alane, Yan, Claudia, Schluger, Charlotte, Yu, Stanley, Khader, Hadi, Mouallem, Marwa, Zhang, Iris, Artzi, Yoav, Suhr, Alane, Yan, Claudia, Schluger, Charlotte, Yu, Stanley, Khader, Hadi, Mouallem, Marwa, Zhang, Iris, and Artzi, Yoav
Abstract: We study a collaborative scenario where a user not only instructs a system to complete tasks, but also acts alongside it. This allows the user to adapt to the system abilities by changing their language or deciding to simply accomplish some tasks themselves, and requires the system to effectively recover from errors as the user strategically assigns it new goals. We build a game environment to study this scenario, and learn to map user instructions to system actions. We introduce a learning approach focused on recovery from cascading errors between instructions, and modeling methods to explicitly reason about instructions with multiple goals. We evaluate with a new evaluation protocol using recorded interactions and online games with human users, and observe how users adapt to the system abilities., Comment: EMNLP 2019 long paper
Published: 2019

30. NLVR2 Visual Bias Analysis

Author: Suhr, Alane, Artzi, Yoav, Suhr, Alane, and Artzi, Yoav
Abstract: NLVR2 (Suhr et al., 2019) was designed to be robust for language bias through a data collection process that resulted in each natural language sentence appearing with both true and false labels. The process did not provide a similar measure of control for visual bias. This technical report analyzes the potential for visual bias in NLVR2. We show that some amount of visual bias likely exists. Finally, we identify a subset of the test data that allows to test for model performance in a way that is robust to such potential biases. We show that the performance of existing models (Li et al., 2019; Tan and Bansal 2019) is relatively robust to this potential bias. We propose to add the evaluation on this subset of the data to the NLVR2 evaluation protocol, and update the official release to include it. A notebook including an implementation of the code used to replicate this analysis is available at http://nlvr.ai/NLVR2BiasAnalysis.html., Comment: Corresponding notebook available at http://lil.nlp.cornell.edu/nlvr/NLVR2BiasAnalysis.html
Published: 2019

31. BERTScore: Evaluating Text Generation with BERT

Author: Zhang, Tianyi, Kishore, Varsha, Wu, Felix, Weinberger, Kilian Q., Artzi, Yoav, Zhang, Tianyi, Kishore, Varsha, Wu, Felix, Weinberger, Kilian Q., and Artzi, Yoav
Abstract: We propose BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTScore correlates better with human judgments and provides stronger model selection performance than existing metrics. Finally, we use an adversarial paraphrase detection task to show that BERTScore is more robust to challenging examples when compared to existing metrics., Comment: Code available at https://github.com/Tiiiger/bert_score; To appear in ICLR2020
Published: 2019

32. Touchdown: Natural Language Navigation and Spatial Reasoning in Visual Street Environments

Author: Chen, Howard, Suhr, Alane, Misra, Dipendra, Snavely, Noah, Artzi, Yoav, Chen, Howard, Suhr, Alane, Misra, Dipendra, Snavely, Noah, and Artzi, Yoav
Abstract: We study the problem of jointly reasoning about language and vision through a navigation and spatial reasoning task. We introduce the Touchdown task and dataset, where an agent must first follow navigation instructions in a real-life visual urban environment, and then identify a location described in natural language to find a hidden object at the goal position. The data contains 9,326 examples of English instructions and spatial descriptions paired with demonstrations. Empirical analysis shows the data presents an open challenge to existing methods, and qualitative linguistic analysis shows that the data displays richer use of spatial reasoning compared to related resources., Comment: arXiv admin note: text overlap with arXiv:1809.00786
Published: 2018

33. Early Fusion for Goal Directed Robotic Vision

Author: Walsman, Aaron, Bisk, Yonatan, Gabriel, Saadia, Misra, Dipendra, Artzi, Yoav, Choi, Yejin, Fox, Dieter, Walsman, Aaron, Bisk, Yonatan, Gabriel, Saadia, Misra, Dipendra, Artzi, Yoav, Choi, Yejin, and Fox, Dieter
Abstract: Building perceptual systems for robotics which perform well under tight computational budgets requires novel architectures which rethink the traditional computer vision pipeline. Modern vision architectures require the agent to build a summary representation of the entire scene, even if most of the input is irrelevant to the agent's current goal. In this work, we flip this paradigm, by introducing EarlyFusion vision models that condition on a goal to build custom representations for downstream tasks. We show that these goal specific representations can be learned more quickly, are substantially more parameter efficient, and more robust than existing attention mechanisms in our domain. We demonstrate the effectiveness of these methods on a simulated robotic item retrieval problem that is trained in a fully end-to-end manner via imitation learning.
Published: 2018

34. Mapping Navigation Instructions to Continuous Control Actions with Position-Visitation Prediction

Author: Blukis, Valts, Misra, Dipendra, Knepper, Ross A., Artzi, Yoav, Blukis, Valts, Misra, Dipendra, Knepper, Ross A., and Artzi, Yoav
Abstract: We propose an approach for mapping natural language instructions and raw observations to continuous control of a quadcopter drone. Our model predicts interpretable position-visitation distributions indicating where the agent should go during execution and where it should stop, and uses the predicted distributions to select the actions to execute. This two-step model decomposition allows for simple and efficient training using a combination of supervised learning and imitation learning. We evaluate our approach with a realistic drone simulator, and demonstrate absolute task-completion accuracy improvements of 16.85% over two state-of-the-art instruction-following methods., Comment: Appeared in Conference on Robot Learning 2018
Published: 2018

35. A Corpus for Reasoning About Natural Language Grounded in Photographs

Author: Suhr, Alane, Zhou, Stephanie, Zhang, Ally, Zhang, Iris, Bai, Huajun, Artzi, Yoav, Suhr, Alane, Zhou, Stephanie, Zhang, Ally, Zhang, Iris, Bai, Huajun, and Artzi, Yoav
Abstract: We introduce a new dataset for joint reasoning about natural language and images, with a focus on semantic diversity, compositionality, and visual reasoning challenges. The data contains 107,292 examples of English sentences paired with web photographs. The task is to determine whether a natural language caption is true about a pair of photographs. We crowdsource the data using sets of visually rich images and a compare-and-contrast task to elicit linguistically diverse language. Qualitative analysis shows the data requires compositional joint reasoning, including about quantities, comparisons, and relations. Evaluation using state-of-the-art visual reasoning methods shows the data presents a strong challenge., Comment: ACL 2019 Long Paper
Published: 2018

36. Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction

Author: Misra, Dipendra, Bennett, Andrew, Blukis, Valts, Niklasson, Eyvind, Shatkhin, Max, Artzi, Yoav, Misra, Dipendra, Bennett, Andrew, Blukis, Valts, Niklasson, Eyvind, Shatkhin, Max, and Artzi, Yoav
Abstract: We propose to decompose instruction execution to goal prediction and action generation. We design a model that maps raw visual observations to goals using LINGUNET, a language-conditioned image generation network, and then generates the actions required to complete them. Our model is trained from demonstration only without external resources. To evaluate our approach, we introduce two benchmarks for instruction following: LANI, a navigation task; and CHAI, where an agent executes household instructions. Our evaluation demonstrates the advantages of our model decomposition, and illustrates the challenges posed by our new benchmarks., Comment: Accepted at EMNLP 2018
Published: 2018

37. Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies

Author: Grusky, Max, Naaman, Mor, Artzi, Yoav, Grusky, Max, Naaman, Mor, and Artzi, Yoav
Abstract: We present NEWSROOM, a summarization dataset of 1.3 million articles and summaries written by authors and editors in newsrooms of 38 major news publications. Extracted from search and social media metadata between 1998 and 2017, these high-quality summaries demonstrate high diversity of summarization styles. In particular, the summaries combine abstractive and extractive strategies, borrowing words and phrases from articles at varying rates. We analyze the extraction strategies used in NEWSROOM summaries against other datasets to quantify the diversity and difficulty of our new data, and train existing methods on the data to evaluate its utility and challenges., Comment: Proceedings of NAACL-HLT 2018 (Long Paper)
Published: 2018

38. Learning to Map Context-Dependent Sentences to Executable Formal Queries

Author: Suhr, Alane, Iyer, Srinivasan, Artzi, Yoav, Suhr, Alane, Iyer, Srinivasan, and Artzi, Yoav
Abstract: We propose a context-dependent model to map utterances within an interaction to executable formal queries. To incorporate interaction history, the model maintains an interaction-level encoder that updates after each turn, and can copy sub-sequences of previously predicted queries during generation. Our approach combines implicit and explicit modeling of references between utterances. We evaluate our model on the ATIS flight planning interactions, and demonstrate the benefits of modeling context and explicit references., Comment: NAACL-HLT 2018 Long Paper
Published: 2018

39. Following High-level Navigation Instructions on a Simulated Quadcopter with Imitation Learning

Author: Blukis, Valts, Brukhim, Nataly, Bennett, Andrew, Knepper, Ross A., Artzi, Yoav, Blukis, Valts, Brukhim, Nataly, Bennett, Andrew, Knepper, Ross A., and Artzi, Yoav
Abstract: We introduce a method for following high-level navigation instructions by mapping directly from images, instructions and pose estimates to continuous low-level velocity commands for real-time control. The Grounded Semantic Mapping Network (GSMN) is a fully-differentiable neural network architecture that builds an explicit semantic map in the world reference frame by incorporating a pinhole camera projection model within the network. The information stored in the map is learned from experience, while the local-to-world transformation is computed explicitly. We train the model using DAggerFM, a modified variant of DAgger that trades tabular convergence guarantees for improved training speed and memory use. We test GSMN in virtual environments on a realistic quadcopter simulator and show that incorporating an explicit mapping and grounding modules allows GSMN to outperform strong neural baselines and almost reach an expert policy performance. Finally, we analyze the learned map representations and show that using an explicit map leads to an interpretable instruction-following model., Comment: To appear in Robotics: Science and Systems (RSS), 2018
Published: 2018

40. Situated Mapping of Sequential Instructions to Actions with Single-step Reward Observation

Author: Suhr, Alane, Artzi, Yoav, Suhr, Alane, and Artzi, Yoav
Abstract: We propose a learning approach for mapping context-dependent sequential instructions to actions. We address the problem of discourse and state dependencies with an attention-based model that considers both the history of the interaction and the state of the world. To train from start and goal states without access to demonstrations, we propose SESTRA, a learning algorithm that takes advantage of single-step reward observations and immediate expected reward maximization. We evaluate on the SCONE domains, and show absolute accuracy improvements of 9.8%-25.3% across the domains over approaches that use high-level logical representations., Comment: ACL 2018 Long Paper
Published: 2018

41. CHALET: Cornell House Agent Learning Environment

Author: Yan, Claudia, Misra, Dipendra, Bennnett, Andrew, Walsman, Aaron, Bisk, Yonatan, Artzi, Yoav, Yan, Claudia, Misra, Dipendra, Bennnett, Andrew, Walsman, Aaron, Bisk, Yonatan, and Artzi, Yoav
Abstract: We present CHALET, a 3D house simulator with support for navigation and manipulation. CHALET includes 58 rooms and 10 house configuration, and allows to easily create new house and room layouts. CHALET supports a range of common household activities, including moving objects, toggling appliances, and placing objects inside closeable containers. The environment and actions available are designed to create a challenging domain to train and evaluate autonomous agents, including for tasks that combine language, vision, and planning in a dynamic environment.
Published: 2018

42. Simple Recurrent Units for Highly Parallelizable Recurrence

Author: Lei, Tao, Zhang, Yu, Wang, Sida I., Dai, Hui, Artzi, Yoav, Lei, Tao, Zhang, Yu, Wang, Sida I., Dai, Hui, and Artzi, Yoav
Abstract: Common recurrent neural architectures scale poorly due to the intrinsic difficulty in parallelizing their state computations. In this work, we propose the Simple Recurrent Unit (SRU), a light recurrent unit that balances model capacity and scalability. SRU is designed to provide expressive recurrence, enable highly parallelized implementation, and comes with careful initialization to facilitate training of deep models. We demonstrate the effectiveness of SRU on multiple NLP tasks. SRU achieves 5--9x speed-up over cuDNN-optimized LSTM on classification and question answering datasets, and delivers stronger results than LSTM and convolutional models. We also obtain an average of 0.7 BLEU improvement over the Transformer model on translation by incorporating SRU into the architecture., Comment: EMNLP
Published: 2017

43. Mapping Instructions and Visual Observations to Actions with Reinforcement Learning

Author: Misra, Dipendra, Langford, John, Artzi, Yoav, Misra, Dipendra, Langford, John, and Artzi, Yoav
Abstract: We propose to directly map raw visual observations and text input to actions for instruction execution. While existing approaches assume access to structured environment representations or use a pipeline of separately trained models, we learn a single model to jointly reason about linguistic and visual input. We use reinforcement learning in a contextual bandit setting to train a neural network agent. To guide the agent's exploration, we use reward shaping with different forms of supervision. Our approach does not require intermediate representations, planning procedures, or training different models. We evaluate in a simulated environment, and show significant improvements over supervised learning and common reinforcement learning variants., Comment: In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2017
Published: 2017

44. Visual Reasoning with Natural Language

Author: Zhou, Stephanie, Suhr, Alane, Artzi, Yoav, Zhou, Stephanie, Suhr, Alane, and Artzi, Yoav
Abstract: Natural language provides a widely accessible and expressive interface for robotic agents. To understand language in complex environments, agents must reason about the full range of language inputs and their correspondence to the world. Such reasoning over language and vision is an open problem that is receiving increasing attention. While existing data sets focus on visual diversity, they do not display the full range of natural language expressions, such as counting, set reasoning, and comparisons. We propose a simple task for natural language visual reasoning, where images are paired with descriptive statements. The task is to predict if a statement is true for the given scene. This abstract describes our existing synthetic images corpus and our current work on collecting real vision data., Comment: AAAI NCHRC 2017
Published: 2017

45. Learning to Automatically Solve Algebra Word Problems

Author: Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science, Kushman, Nate, Barzilay, Regina, Artzi, Yoav, Zettlemoyer, Luke, Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science, Kushman, Nate, Barzilay, Regina, Artzi, Yoav, and Zettlemoyer, Luke
Abstract: We present an approach for automatically learning to solve algebra word problems. Our algorithm reasons across sentence boundaries to construct and solve a system of linear equations, while simultaneously recovering an alignment of the variables and numbers in these equations to the problem text. The learning algorithm uses varied supervision, including either full equations or just the final answers. We evaluate performance on a newly gathered corpus of algebra word problems, demonstrating that the system can correctly answer almost 70% of the questions in the dataset. This is, to our knowledge, the first learning result for this task., Battelle Memorial Institute (PO 300662), National Science Foundation (U.S.) (Grant IIS-0835652)
Published: 2015

46. Cornell SPF: Cornell Semantic Parsing Framework

Author: Artzi, Yoav and Artzi, Yoav
Abstract: The Cornell Semantic Parsing Framework (SPF) is a learning and inference framework for mapping natural language to formal representation of its meaning.
Published: 2013

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Publication Year Range

Publication Type

Database

Publisher

46 results on '"Artzi, Yoav"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources