663 results on '"Yu Wenhao"'
Search Results
2. Simultaneously enhance the fire safety and mechanical properties of PLA by incorporating a cyclophosphazene-based flame retardant
- Author
-
Yu Wenhao, Yang Weijun, Xu Pengwu, Dai Chunfa, Liu Qingsheng, and Ma Piming
- Subjects
poly(lactic acid) ,cyclophosphazene-containing flame retardant ,fire safety ,mechanical properties ,Polymers and polymer manufacture ,TP1080-1185 - Abstract
The application of poly(lactic acid) (PLA) has been limited in flame-retardant field, and flame-retardant modification usually deteriorates its mechanical properties. In this study, a reactive flame-retardant hexa(ethylene oxide)-cyclotriphosphazene (HCCP-EP) was synthesized and used to improve the fire retardancy of PLA. As a result, the limiting oxygen index of PLA increased from 19.5% to 27.3% with an addition of 3 wt% HCCP-EP, and the PLA/HCCP-EP blend reached to underwriters laboratories (UL)-94 V-0 rating. The cone calorimeter test results showed that the peak heat release rate and total heat release of PLA decreased by 12.6% and 18.5%, respectively. Interestingly, the tensile strength of PLA increased slightly after the incorporation of HCCP-EP. The improved mechanical properties are ascribed to the fine dispersion of HCCP-EP and the coupling reaction between the epoxy groups of the HCCP-EP and the terminal groups of PLA during the melt processing.
- Published
- 2022
- Full Text
- View/download PDF
3. Novel predictor of the occurrence of DKA in T1DM patients without infection: A combination of neutrophil/lymphocyte ratio and white blood cells
- Author
-
Cheng Yiping, Yu Wenhao, Zhou Yuping, Zhang Tao, Chi Haiyan, and Xu Chao
- Subjects
lymphocytes ,white blood cells ,diabetic ketoacidosis ,Biology (General) ,QH301-705.5 - Abstract
The role of inflammation has been identified in the pathogenesis of diabetic ketoacidosis (DKA). The neutrophil/lymphocyte ratio (NLR) and white blood cells (WBC) can be used to predict a systemic inflammatory response. Changes in NLR and WBC levels have never been explored in type 1 diabetes mellitus (T1DM) patients with DKA and an uninfected state. This retrospective study included a total of 644 participants. NLR and WBC were measured in the control group (n = 316) and in T1DM patients with mild-DKA (n = 92), severe-DKA (n = 52), and non-DKA (n = 184) in an uninfected state. Then, we assessed the independent predictors of DKA occurrence in T1DM patients in an uninfected state. The diagnostic performance of variables was determined by receiver operating characteristic curve analysis. Serum NLR of T1DM patients is significantly higher than that of normal controls, and if DKA occurs, NLR increases further and increases with the severity of DKA. In addition to diastolic blood pressure, blood urea nitrogen, glycated hemoglobin (HbA1c), and WBC, NLR was also independently associated with DKA in T1DM patients with an uninfected state (OR = 1.386, 95% CI: 1.127–1.705, p = 0.002). Furthermore, the diagnosis analysis showed that except for NLR and WBC, the area under the curve (AUC) of indicators with a statistical difference in patients with and without DKA were 0.747 for DKA diagnosis, and after the addition of NLR and WBC, the AUC was 0.806. The increased NLR level represents a low-cost and highly accessible predictor for DKA in T1DM patients with an uninfected state. The addition of inflammation indicators can play a statistically significant role in the prediction model of the DKA occurrence.
- Published
- 2021
- Full Text
- View/download PDF
4. Learning Multi-Agent Loco-Manipulation for Long-Horizon Quadrupedal Pushing
- Author
-
Feng, Yuming, Hong, Chuye, Niu, Yaru, Liu, Shiqi, Yang, Yuxiang, Yu, Wenhao, Zhang, Tingnan, Tan, Jie, and Zhao, Ding
- Subjects
Computer Science - Robotics ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning ,Computer Science - Multiagent Systems - Abstract
Recently, quadrupedal locomotion has achieved significant success, but their manipulation capabilities, particularly in handling large objects, remain limited, restricting their usefulness in demanding real-world applications such as search and rescue, construction, industrial automation, and room organization. This paper tackles the task of obstacle-aware, long-horizon pushing by multiple quadrupedal robots. We propose a hierarchical multi-agent reinforcement learning framework with three levels of control. The high-level controller integrates an RRT planner and a centralized adaptive policy to generate subgoals, while the mid-level controller uses a decentralized goal-conditioned policy to guide the robots toward these sub-goals. A pre-trained low-level locomotion policy executes the movement commands. We evaluate our method against several baselines in simulation, demonstrating significant improvements over baseline approaches, with 36.0% higher success rates and 24.5% reduction in completion time than the best baseline. Our framework successfully enables long-horizon, obstacle-aware manipulation tasks like Push-Cuboid and Push-T on Go1 robots in the real world.
- Published
- 2024
5. Vision Language Models are In-Context Value Learners
- Author
-
Ma, Yecheng Jason, Hejna, Joey, Wahid, Ayzaan, Fu, Chuyuan, Shah, Dhruv, Liang, Jacky, Xu, Zhuo, Kirmani, Sean, Xu, Peng, Driess, Danny, Xiao, Ted, Tompson, Jonathan, Bastani, Osbert, Jayaraman, Dinesh, Yu, Wenhao, Zhang, Tingnan, Sadigh, Dorsa, and Xia, Fei
- Subjects
Computer Science - Robotics ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning - Abstract
Predicting temporal progress from visual trajectories is important for intelligent robots that can learn, adapt, and improve. However, learning such progress estimator, or temporal value function, across different tasks and domains requires both a large amount of diverse data and methods which can scale and generalize. To address these challenges, we present Generative Value Learning (\GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress. Naively asking a VLM to predict values for a video sequence performs poorly due to the strong temporal correlation between successive frames. Instead, GVL poses value estimation as a temporal ordering problem over shuffled video frames; this seemingly more challenging task encourages VLMs to more fully exploit their underlying semantic and temporal grounding capabilities to differentiate frames based on their perceived task progress, consequently producing significantly better value predictions. Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks across diverse robot platforms, including challenging bimanual manipulation tasks. Furthermore, we demonstrate that GVL permits flexible multi-modal in-context learning via examples from heterogeneous tasks and embodiments, such as human videos. The generality of GVL enables various downstream applications pertinent to visuomotor policy learning, including dataset filtering, success detection, and advantage-weighted regression -- all without any model training or finetuning., Comment: Project website and demo: https://generative-value-learning.github.io/
- Published
- 2024
6. OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization
- Author
-
He, Hongliang, Yao, Wenlin, Ma, Kaixin, Yu, Wenhao, Zhang, Hongming, Fang, Tianqing, Lan, Zhenzhong, and Yu, Dong
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
The rapid development of large language and multimodal models has sparked significant interest in using proprietary models, such as GPT-4o, to develop autonomous agents capable of handling real-world scenarios like web navigation. Although recent open-source efforts have tried to equip agents with the ability to explore environments and continuously improve over time, they are building text-only agents in synthetic environments where the reward signals are clearly defined. Such agents struggle to generalize to realistic settings that require multimodal perception abilities and lack ground-truth signals. In this paper, we introduce an open-source framework designed to facilitate the development of multimodal web agent that can autonomously conduct real-world exploration and improve itself. We first train the base model with imitation learning to gain the basic abilities. We then let the agent explore the open web and collect feedback on its trajectories. After that, it further improves its policy by learning from well-performing trajectories judged by another general-purpose model. This exploration-feedback-optimization cycle can continue for several iterations. Experimental results show that our web agent successfully improves itself after each iteration, demonstrating strong performance across multiple test sets.
- Published
- 2024
7. LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory
- Author
-
Wu, Di, Wang, Hongwei, Yu, Wenhao, Zhang, Yuwei, Chang, Kai-Wei, and Yu, Dong
- Subjects
Computer Science - Computation and Language - Abstract
Recent large language model (LLM)-driven chat assistant systems have integrated memory components to track user-assistant chat histories, enabling more accurate and personalized responses. However, their long-term memory capabilities in sustained interactions remain underexplored. This paper introduces LongMemEval, a comprehensive benchmark designed to evaluate five core long-term memory abilities of chat assistants: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. With 500 meticulously curated questions embedded within freely scalable user-assistant chat histories, LongMemEval presents a significant challenge to existing long-term memory systems, with commercial chat assistants and long-context LLMs showing 30% accuracy drop on memorizing information across sustained interactions. We then present a unified framework that breaks down the long-term memory design into four design choices across the indexing, retrieval, and reading stages. Built upon key experimental insights, we propose several memory designs including session decomposition for optimizing value granularity, fact-augmented key expansion for enhancing the index structure, and time-aware query expansion for refining the search scope. Experiment results show that these optimizations greatly improve both memory recall and downstream question answering on LongMemEval. Overall, our study provides valuable resources and guidance for advancing the long-term memory capabilities of LLM-based chat assistants, paving the way toward more personalized and reliable conversational AI.
- Published
- 2024
8. RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph
- Author
-
Ouyang, Siru, Yu, Wenhao, Ma, Kaixin, Xiao, Zilin, Zhang, Zhihan, Jia, Mengzhao, Han, Jiawei, Zhang, Hongming, and Yu, Dong
- Subjects
Computer Science - Software Engineering ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language - Abstract
Large Language Models (LLMs) excel in code generation yet struggle with modern AI software engineering tasks. Unlike traditional function-level or file-level coding tasks, AI software engineering requires not only basic coding proficiency but also advanced skills in managing and interacting with code repositories. However, existing methods often overlook the need for repository-level code understanding, which is crucial for accurately grasping the broader context and developing effective solutions. On this basis, we present RepoGraph, a plug-in module that manages a repository-level structure for modern AI software engineering solutions. RepoGraph offers the desired guidance and serves as a repository-wide navigation for AI software engineers. We evaluate RepoGraph on the SWE-bench by plugging it into four different methods of two lines of approaches, where RepoGraph substantially boosts the performance of all systems, leading to a new state-of-the-art among open-source frameworks. Our analyses also demonstrate the extensibility and flexibility of RepoGraph by testing on another repo-level coding benchmark, CrossCodeEval. Our code is available at https://github.com/ozyyshr/RepoGraph., Comment: Work in progress
- Published
- 2024
9. Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks
- Author
-
Jia, Mengzhao, Yu, Wenhao, Ma, Kaixin, Fang, Tianqing, Zhang, Zhihan, Ouyang, Siru, Zhang, Hongming, Jiang, Meng, and Yu, Dong
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Computation and Language - Abstract
Text-rich images, where text serves as the central visual element guiding the overall understanding, are prevalent in real-world applications, such as presentation slides, scanned documents, and webpage snapshots. Tasks involving multiple text-rich images are especially challenging, as they require not only understanding the content of individual images but reasoning about inter-relationships and logical flows across multiple visual inputs. Despite the importance of these scenarios, current multimodal large language models (MLLMs) struggle to handle such tasks due to two key challenges: (1) the scarcity of high-quality instruction tuning datasets for text-rich multi-image scenarios, and (2) the difficulty in balancing image resolution with visual feature sequence length. To address these challenges, we propose Leopard, a MLLM designed specifically for handling vision-language tasks involving multiple text-rich images. First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios. Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length based on the original aspect ratios and resolutions of the input images. Experiments across a wide range of benchmarks demonstrate our model's superior capabilities in text-rich, multi-image evaluations and competitive performance in general domain evaluations., Comment: Our code is available at https://github.com/Jill0001/Leopard
- Published
- 2024
10. MHRC: Closed-loop Decentralized Multi-Heterogeneous Robot Collaboration with Large Language Models
- Author
-
Yu, Wenhao, Peng, Jie, Ying, Yueliang, Li, Sai, Ji, Jianmin, and Zhang, Yanyong
- Subjects
Computer Science - Robotics - Abstract
The integration of large language models (LLMs) with robotics has significantly advanced robots' abilities in perception, cognition, and task planning. The use of natural language interfaces offers a unified approach for expressing the capability differences of heterogeneous robots, facilitating communication between them, and enabling seamless task allocation and collaboration. Currently, the utilization of LLMs to achieve decentralized multi-heterogeneous robot collaborative tasks remains an under-explored area of research. In this paper, we introduce a novel framework that utilizes LLMs to achieve decentralized collaboration among multiple heterogeneous robots. Our framework supports three robot categories, mobile robots, manipulation robots, and mobile manipulation robots, working together to complete tasks such as exploration, transportation, and organization. We developed a rich set of textual feedback mechanisms and chain-of-thought (CoT) prompts to enhance task planning efficiency and overall system performance. The mobile manipulation robot can adjust its base position flexibly, ensuring optimal conditions for grasping tasks. The manipulation robot can comprehend task requirements, seek assistance when necessary, and handle objects appropriately. Meanwhile, the mobile robot can explore the environment extensively, map object locations, and communicate this information to the mobile manipulation robot, thus improving task execution efficiency. We evaluated the framework using PyBullet, creating scenarios with three different room layouts and three distinct operational tasks. We tested various LLM models and conducted ablation studies to assess the contributions of different modules. The experimental results confirm the effectiveness and necessity of our proposed framework.
- Published
- 2024
11. Agile Continuous Jumping in Discontinuous Terrains
- Author
-
Yang, Yuxiang, Shi, Guanya, Lin, Changyi, Meng, Xiangyun, Scalise, Rosario, Castro, Mateo Guaman, Yu, Wenhao, Zhang, Tingnan, Zhao, Ding, Tan, Jie, and Boots, Byron
- Subjects
Computer Science - Robotics - Abstract
We focus on agile, continuous, and terrain-adaptive jumping of quadrupedal robots in discontinuous terrains such as stairs and stepping stones. Unlike single-step jumping, continuous jumping requires accurately executing highly dynamic motions over long horizons, which is challenging for existing approaches. To accomplish this task, we design a hierarchical learning and control framework, which consists of a learned heightmap predictor for robust terrain perception, a reinforcement-learning-based centroidal-level motion policy for versatile and terrain-adaptive planning, and a low-level model-based leg controller for accurate motion tracking. In addition, we minimize the sim-to-real gap by accurately modeling the hardware characteristics. Our framework enables a Unitree Go1 robot to perform agile and continuous jumps on human-sized stairs and sparse stepping stones, for the first time to the best of our knowledge. In particular, the robot can cross two stair steps in each jump and completes a 3.5m long, 2.8m high, 14-step staircase in 4.5 seconds. Moreover, the same policy outperforms baselines in various other parkour tasks, such as jumping over single horizontal or vertical discontinuities. Experiment videos can be found at https://yxyang.github.io/jumping_cod/, Comment: Website: https://yxyang.github.io/jumping_cod/
- Published
- 2024
12. Cognitive Kernel: An Open-source Agent System towards Generalist Autopilots
- Author
-
Zhang, Hongming, Pan, Xiaoman, Wang, Hongwei, Ma, Kaixin, Yu, Wenhao, and Yu, Dong
- Subjects
Computer Science - Artificial Intelligence - Abstract
We introduce Cognitive Kernel, an open-source agent system towards the goal of generalist autopilots. Unlike copilot systems, which primarily rely on users to provide essential state information (e.g., task descriptions) and assist users by answering questions or auto-completing contents, autopilot systems must complete tasks from start to finish independently, which requires the system to acquire the state information from the environments actively. To achieve this, an autopilot system should be capable of understanding user intents, actively gathering necessary information from various real-world sources, and making wise decisions. Cognitive Kernel adopts a model-centric design. In our implementation, the central policy model (a fine-tuned LLM) initiates interactions with the environment using a combination of atomic actions, such as opening files, clicking buttons, saving intermediate results to memory, or calling the LLM itself. This differs from the widely used environment-centric design, where a task-specific environment with predefined actions is fixed, and the policy model is limited to selecting the correct action from a given set of options. Our design facilitates seamless information flow across various sources and provides greater flexibility. We evaluate our system in three use cases: real-time information management, private information management, and long-term memory management. The results demonstrate that Cognitive Kernel achieves better or comparable performance to other closed-source systems in these scenarios. Cognitive Kernel is fully dockerized, ensuring everyone can deploy it privately and securely. We open-source the system and the backbone model to encourage further research on LLM-driven autopilot systems.
- Published
- 2024
13. DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?
- Author
-
Jing, Liqiang, Huang, Zhehui, Wang, Xiaoyang, Yao, Wenlin, Yu, Wenhao, Ma, Kaixin, Zhang, Hongming, Du, Xinya, and Yu, Dong
- Subjects
Computer Science - Artificial Intelligence ,Computer Science - Computation and Language - Abstract
Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have demonstrated impressive language/vision reasoning abilities, igniting the recent trend of building agents for targeted applications such as shopping assistants or AI software engineers. Recently, many data science benchmarks have been proposed to investigate their performance in the data science domain. However, existing data science benchmarks still fall short when compared to real-world data science applications due to their simplified settings. To bridge this gap, we introduce DSBench, a comprehensive benchmark designed to evaluate data science agents with realistic tasks. This benchmark includes 466 data analysis tasks and 74 data modeling tasks, sourced from Eloquence and Kaggle competitions. DSBench offers a realistic setting by encompassing long contexts, multimodal task backgrounds, reasoning with large data files and multi-table structures, and performing end-to-end data modeling tasks. Our evaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle with most tasks, with the best agent solving only 34.12% of data analysis tasks and achieving a 34.74% Relative Performance Gap (RPG). These findings underscore the need for further advancements in developing more practical, intelligent, and autonomous data science agents.
- Published
- 2024
14. OASIS: Conditional Distribution Shaping for Offline Safe Reinforcement Learning
- Author
-
Yao, Yihang, Cen, Zhepeng, Ding, Wenhao, Lin, Haohong, Liu, Shiqi, Zhang, Tingnan, Yu, Wenhao, and Zhao, Ding
- Subjects
Computer Science - Machine Learning - Abstract
Offline safe reinforcement learning (RL) aims to train a policy that satisfies constraints using a pre-collected dataset. Most current methods struggle with the mismatch between imperfect demonstrations and the desired safe and rewarding performance. In this paper, we introduce OASIS (cOnditionAl diStributIon Shaping), a new paradigm in offline safe RL designed to overcome these critical limitations. OASIS utilizes a conditional diffusion model to synthesize offline datasets, thus shaping the data distribution toward a beneficial target domain. Our approach makes compliance with safety constraints through effective data utilization and regularization techniques to benefit offline safe RL training. Comprehensive evaluations on public benchmarks and varying datasets showcase OASIS's superiority in benefiting offline safe RL agents to achieve high-reward behavior while satisfying the safety constraints, outperforming established baselines. Furthermore, OASIS exhibits high data efficiency and robustness, making it suitable for real-world applications, particularly in tasks where safety is imperative and high-quality demonstrations are scarce.
- Published
- 2024
15. DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems
- Author
-
Zou, Anni, Yu, Wenhao, Zhang, Hongming, Ma, Kaixin, Cai, Deng, Zhang, Zhuosheng, Zhao, Hai, and Yu, Dong
- Subjects
Computer Science - Computation and Language - Abstract
Recently, there has been a growing interest among large language model (LLM) developers in LLM-based document reading systems, which enable users to upload their own documents and pose questions related to the document contents, going beyond simple reading comprehension tasks. Consequently, these systems have been carefully designed to tackle challenges such as file parsing, metadata extraction, multi-modal information understanding and long-context reading. However, no current benchmark exists to evaluate their performance in such scenarios, where a raw file and questions are provided as input, and a corresponding response is expected as output. In this paper, we introduce DocBench, a new benchmark designed to evaluate LLM-based document reading systems. Our benchmark involves a meticulously crafted process, including the recruitment of human annotators and the generation of synthetic questions. It includes 229 real documents and 1,102 questions, spanning across five different domains and four major types of questions. We evaluate both proprietary LLM-based systems accessible via web interfaces or APIs, and a parse-then-read pipeline employing open-source LLMs. Our evaluations reveal noticeable gaps between existing LLM-based document reading systems and human performance, underscoring the challenges of developing proficient systems. To summarize, DocBench aims to establish a standardized benchmark for evaluating LLM-based document reading systems under diverse real-world scenarios, thereby guiding future advancements in this research area., Comment: Work in progress
- Published
- 2024
16. Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs
- Author
-
Chiang, Hao-Tien Lewis, Xu, Zhuo, Fu, Zipeng, Jacob, Mithun George, Zhang, Tingnan, Lee, Tsang-Wei Edward, Yu, Wenhao, Schenck, Connor, Rendleman, David, Shah, Dhruv, Xia, Fei, Hsu, Jasmine, Hoech, Jonathan, Florence, Pete, Kirmani, Sean, Singh, Sumeet, Sindhwani, Vikas, Parada, Carolina, Finn, Chelsea, Xu, Peng, Levine, Sergey, and Tan, Jie
- Subjects
Computer Science - Robotics ,Computer Science - Artificial Intelligence - Abstract
An elusive goal in navigation research is to build an intelligent agent that can understand multimodal instructions including natural language and image, and perform useful navigation. To achieve this, we study a widely useful category of navigation tasks we call Multimodal Instruction Navigation with demonstration Tours (MINT), in which the environment prior is provided through a previously recorded demonstration video. Recent advances in Vision Language Models (VLMs) have shown a promising path in achieving this goal as it demonstrates capabilities in perceiving and reasoning about multimodal inputs. However, VLMs are typically trained to predict textual output and it is an open research question about how to best utilize them in navigation. To solve MINT, we present Mobility VLA, a hierarchical Vision-Language-Action (VLA) navigation policy that combines the environment understanding and common sense reasoning power of long-context VLMs and a robust low-level navigation policy based on topological graphs. The high-level policy consists of a long-context VLM that takes the demonstration tour video and the multimodal user instruction as input to find the goal frame in the tour video. Next, a low-level policy uses the goal frame and an offline constructed topological graph to generate robot actions at every timestep. We evaluated Mobility VLA in a 836m^2 real world environment and show that Mobility VLA has a high end-to-end success rates on previously unsolved multimodal instructions such as "Where should I return this?" while holding a plastic bin. A video demonstrating Mobility VLA can be found here: https://youtu.be/-Tof__Q8_5s
- Published
- 2024
17. LDP: A Local Diffusion Planner for Efficient Robot Navigation and Collision Avoidance
- Author
-
Yu, Wenhao, Peng, Jie, Yang, Huanyu, Zhang, Junrui, Duan, Yifan, Ji, Jianmin, and Zhang, Yanyong
- Subjects
Computer Science - Robotics ,Computer Science - Artificial Intelligence - Abstract
The conditional diffusion model has been demonstrated as an efficient tool for learning robot policies, owing to its advancement to accurately model the conditional distribution of policies. The intricate nature of real-world scenarios, characterized by dynamic obstacles and maze-like structures, underscores the complexity of robot local navigation decision-making as a conditional distribution problem. Nevertheless, leveraging the diffusion model for robot local navigation is not trivial and encounters several under-explored challenges: (1) Data Urgency. The complex conditional distribution in local navigation needs training data to include diverse policy in diverse real-world scenarios; (2) Myopic Observation. Due to the diversity of the perception scenarios, diffusion decisions based on the local perspective of robots may prove suboptimal for completing the entire task, as they often lack foresight. In certain scenarios requiring detours, the robot may become trapped. To address these issues, our approach begins with an exploration of a diverse data generation mechanism that encompasses multiple agents exhibiting distinct preferences through target selection informed by integrated global-local insights. Then, based on this diverse training data, a diffusion agent is obtained, capable of excellent collision avoidance in diverse scenarios. Subsequently, we augment our Local Diffusion Planner, also known as LDP by incorporating global observations in a lightweight manner. This enhancement broadens the observational scope of LDP, effectively mitigating the risk of becoming ensnared in local optima and promoting more robust navigational decisions., Comment: 8 pages, 6 figures, accepted by IROS 2024
- Published
- 2024
18. BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
- Author
-
Zhuo, Terry Yue, Vu, Minh Chien, Chim, Jenny, Hu, Han, Yu, Wenhao, Widyasari, Ratnadira, Yusuf, Imam Nur Bani, Zhan, Haolan, He, Junda, Paul, Indraneil, Brunner, Simon, Gong, Chen, Hoang, Thong, Zebaze, Armel Randy, Hong, Xiaoheng, Li, Wen-Ding, Kaddour, Jean, Xu, Ming, Zhang, Zhihan, Yadav, Prateek, Jain, Naman, Gu, Alex, Cheng, Zhoujun, Liu, Jiawei, Liu, Qian, Wang, Zijian, Lo, David, Hui, Binyuan, Muennighoff, Niklas, Fried, Daniel, Du, Xiaoning, de Vries, Harm, and Von Werra, Leandro
- Subjects
Computer Science - Software Engineering ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language - Abstract
Task automation has been greatly empowered by the recent advances in Large Language Models (LLMs) via Python code, where the tasks ranging from software engineering development to general-purpose reasoning. While current benchmarks have shown that LLMs can solve tasks using programs like human developers, the majority of their evaluations are limited to short and self-contained algorithmic tasks or standalone function calls. Solving challenging and practical requires the capability of utilizing diverse function calls as tools to efficiently implement functionalities like data analysis and web development. In addition, using multiple tools to solve a task needs compositional reasoning by accurately understanding complex instructions. Fulfilling both of these characteristics can pose a great challenge for LLMs.To assess how well LLMs can solve challenging and practical tasks via programs, we introduce BigCodeBench, a benchmark that challenges LLMs to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained tasks. To evaluate LLMs rigorously, each task encompasses 5.6 test cases with an average branch coverage of 99%. In addition, we propose a natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that automatically transforms the original docstrings into short instructions only with essential information. Our extensive evaluation of 60 LLMs shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%. The results underscore the need for further advancements in this area., Comment: 44 pages, 14 figures, 7 tables, built with love by the BigCode community :)
- Published
- 2024
19. Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning
- Author
-
Zhang, Zhihan, Ge, Tao, Liang, Zhenwen, Yu, Wenhao, Yu, Dian, Jia, Mengzhao, Yu, Dong, and Jiang, Meng
- Subjects
Computer Science - Computation and Language - Abstract
Supervised fine-tuning enhances the problem-solving abilities of language models across various mathematical reasoning tasks. To maximize such benefits, existing research focuses on broadening the training set with various data augmentation techniques, which is effective for standard single-round question-answering settings. Our work introduces a novel technique aimed at cultivating a deeper understanding of the training problems at hand, enhancing performance not only in standard settings but also in more complex scenarios that require reflective thinking. Specifically, we propose reflective augmentation, a method that embeds problem reflection into each training instance. It trains the model to consider alternative perspectives and engage with abstractions and analogies, thereby fostering a thorough comprehension through reflective reasoning. Extensive experiments validate the achievement of our aim, underscoring the unique advantages of our method and its complementary nature relative to existing augmentation techniques., Comment: Accepted to the main conference of EMNLP 2024; v3 fixes several typos, incorrect section numbers, and missing references to Appendix sections in v2
- Published
- 2024
20. Learning-based legged locomotion; state of the art and future perspectives
- Author
-
Ha, Sehoon, Lee, Joonho, van de Panne, Michiel, Xie, Zhaoming, Yu, Wenhao, and Khadiv, Majid
- Subjects
Computer Science - Robotics - Abstract
Legged locomotion holds the premise of universal mobility, a critical capability for many real-world robotic applications. Both model-based and learning-based approaches have advanced the field of legged locomotion in the past three decades. In recent years, however, a number of factors have dramatically accelerated progress in learning-based methods, including the rise of deep learning, rapid progress in simulating robotic systems, and the availability of high-performance and affordable hardware. This article aims to give a brief history of the field, to summarize recent efforts in learning locomotion skills for quadrupeds, and to provide researchers new to the area with an understanding of the key issues involved. With the recent proliferation of humanoid robots, we further outline the rapid rise of analogous methods for bipedal locomotion. We conclude with a discussion of open problems as well as related societal impact.
- Published
- 2024
21. Research on the Spatial Data Intelligent Foundation Model
- Author
-
Wang, Shaohua, Xie, Xing, Li, Yong, Guo, Danhuai, Cai, Zhi, Liu, Yu, Yue, Yang, Pan, Xiao, Lu, Feng, Wu, Huayi, Gui, Zhipeng, Ding, Zhiming, Zheng, Bolong, Zhang, Fuzheng, Wang, Jingyuan, Chen, Zhengchao, Lu, Hao, Li, Jiayi, Yue, Peng, Yu, Wenhao, Yao, Yao, Sun, Leilei, Zhang, Yong, Chen, Longbiao, Du, Xiaoping, Li, Xiang, Zhang, Xueying, Qin, Kun, Gong, Zhaoya, Dong, Weihua, and Meng, Xiaofeng
- Subjects
Computer Science - Artificial Intelligence ,Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
This report focuses on spatial data intelligent large models, delving into the principles, methods, and cutting-edge applications of these models. It provides an in-depth discussion on the definition, development history, current status, and trends of spatial data intelligent large models, as well as the challenges they face. The report systematically elucidates the key technologies of spatial data intelligent large models and their applications in urban environments, aerospace remote sensing, geography, transportation, and other scenarios. Additionally, it summarizes the latest application cases of spatial data intelligent large models in themes such as urban development, multimodal systems, remote sensing, smart transportation, and resource environments. Finally, the report concludes with an overview and outlook on the development prospects of spatial data intelligent large models., Comment: V1 and V2 are in Chinese language, other versions are in English
- Published
- 2024
22. MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions
- Author
-
Liang, Zhenwen, Yu, Dian, Yu, Wenhao, Yao, Wenlin, Zhang, Zhihan, Zhang, Xiangliang, and Yu, Dong
- Subjects
Computer Science - Artificial Intelligence - Abstract
Large language models (LLMs) have demonstrated impressive capabilities in mathematical problem solving, particularly in single turn question answering formats. However, real world scenarios often involve mathematical question answering that requires multi turn or interactive information exchanges, and the performance of LLMs on these tasks is still underexplored. This paper introduces MathChat, a comprehensive benchmark specifically designed to evaluate LLMs across a broader spectrum of mathematical tasks. These tasks are structured to assess the models' abilities in multiturn interactions and open ended generation. We evaluate the performance of various SOTA LLMs on the MathChat benchmark, and we observe that while these models excel in single turn question answering, they significantly underperform in more complex scenarios that require sustained reasoning and dialogue understanding. To address the above limitations of existing LLMs when faced with multiturn and open ended tasks, we develop MathChat sync, a synthetic dialogue based math dataset for LLM finetuning, focusing on improving models' interaction and instruction following capabilities in conversations. Experimental results emphasize the need for training LLMs with diverse, conversational instruction tuning datasets like MathChatsync. We believe this work outlines one promising direction for improving the multiturn mathematical reasoning abilities of LLMs, thus pushing forward the development of LLMs that are more adept at interactive mathematical problem solving and real world applications.
- Published
- 2024
23. All-Optical Manipulation of Band Gap Dynamics via Electron-Phonon Coupling
- Author
-
Zhang, Jicai, Tran, Tien-Dat, Wang, Ziwen, Yu, Wenhao, Zhang, Chong, Lo, Marcus, Xu, Wenqi, and Luu, Tran Trung
- Subjects
Condensed Matter - Materials Science - Abstract
The electron-phonon coupling (EPC) is a ubiquitous interaction in condensed systems and plays a vital role in shaping the electronic properties of materials. Yet, achieving coherent manipulation of electron-phonon coupling has posed a considerable challenge. Here, employing time-resolved high-harmonic generation (tr-HHG) spectroscopy, we demonstrate the coherent manipulation of bandgap dynamics in a BaF2 crystal by precisely controlling the EPC using ultrashort light pulses. The tr-HHG spectrum perturbed by a triply degenerate phonon mode T2g, exhibits simultaneously a remarkable two-dimensional (2D) sensitivity, namely intensity domain in addition to the previously reported energy domain. The dynamic compression and enhancement of the harmonics in the intensity domain showed a {\pi}/2 phase shift compared to the manifestation of shifts of the harmonics in the energy domain, an astounding example of a physical phenomenon being observed simultaneously in two different perspectives. To complement our experimental observations, we employed a quantum model that incorporates the EPC, successfully reproducing the results. In addition, we demonstrated complete control over the EPC strength and initial phase of the coherent phonon oscillations by varying the incident electric field polarization over crystal orientation. Our findings lay a foundation for future investigations aiming to harness and exploit the remarkable potential of EPC in solid-state systems.
- Published
- 2024
24. Gameplay Filters: Robust Zero-Shot Safety through Adversarial Imagination
- Author
-
Nguyen, Duy P., Hsu, Kai-Chieh, Yu, Wenhao, Tan, Jie, and Fisac, Jaime F.
- Subjects
Computer Science - Robotics ,Computer Science - Machine Learning - Abstract
Despite the impressive recent advances in learning-based robot control, ensuring robustness to out-of-distribution conditions remains an open challenge. Safety filters can, in principle, keep arbitrary control policies from incurring catastrophic failures by overriding unsafe actions, but existing solutions for complex (e.g., legged) robot dynamics do not span the full motion envelope and instead rely on local, reduced-order models. These filters tend to overly restrict agility and can still fail when perturbed away from nominal conditions. This paper presents the gameplay filter, a new class of predictive safety filter that continually plays out hypothetical matches between its simulation-trained safety strategy and a virtual adversary co-trained to invoke worst-case events and sim-to-real error, and precludes actions that would cause it to fail down the line. We demonstrate the scalability and robustness of the approach with a first-of-its-kind full-order safety filter for (36-D) quadrupedal dynamics. Physical experiments on two different quadruped platforms demonstrate the superior zero-shot effectiveness of the gameplay filter under large perturbations such as tugging and unmodeled terrain.
- Published
- 2024
25. Describe-then-Reason: Improving Multimodal Mathematical Reasoning through Visual Comprehension Training
- Author
-
Jia, Mengzhao, Zhang, Zhihan, Yu, Wenhao, Jiao, Fangkai, and Jiang, Meng
- Subjects
Computer Science - Computation and Language - Abstract
Open-source multimodal large language models (MLLMs) excel in various tasks involving textual and visual inputs but still struggle with complex multimodal mathematical reasoning, lagging behind proprietary models like GPT-4V(ision) and Gemini-Pro. Although fine-tuning with intermediate steps (i.e., rationales) elicits some mathematical reasoning skills, the resulting models still fall short in visual comprehension due to inadequate visual-centric supervision, which leads to inaccurate interpretation of math figures. To address this issue, we propose a two-step training pipeline VCAR, which emphasizes the Visual Comprehension training in Addition to mathematical Reasoning learning. It first improves the visual comprehension ability of MLLMs through the visual description generation task, followed by another training step on generating rationales with the assistance of descriptions. Experimental results on two popular benchmarks demonstrate that VCAR substantially outperforms baseline methods solely relying on rationale supervision, especially on problems with high visual demands.
- Published
- 2024
26. A Progressive Framework of Vision-language Knowledge Distillation and Alignment for Multilingual Scene
- Author
-
Zhang, Wenbo, Zhang, Yifan, Lin, Jianfeng, Huang, Binqiang, Zhang, Jinlu, and Yu, Wenhao
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Pre-trained vision-language (V-L) models such as CLIP have shown excellent performance in many downstream cross-modal tasks. However, most of them are only applicable to the English context. Subsequent research has focused on this problem and proposed improved models, such as CN-CLIP and AltCLIP, to facilitate their applicability to Chinese and even other languages. Nevertheless, these models suffer from high latency and a large memory footprint in inference, which limits their further deployment on resource-constrained edge devices. In this work, we propose a conceptually simple yet effective multilingual CLIP Compression framework and train a lightweight multilingual vision-language model, called DC-CLIP, for both Chinese and English context. In this framework, we collect high-quality Chinese and English text-image pairs and design two training stages, including multilingual vision-language feature distillation and alignment. During the first stage, lightweight image/text student models are designed to learn robust visual/multilingual textual feature representation ability from corresponding teacher models, respectively. Subsequently, the multilingual vision-language alignment stage enables effective alignment of visual and multilingual textual features to further improve the model's multilingual performance. Comprehensive experiments in zero-shot image classification, conducted based on the ELEVATER benchmark, showcase that DC-CLIP achieves superior performance in the English context and competitive performance in the Chinese context, even with less training data, when compared to existing models of similar parameter magnitude. The evaluation demonstrates the effectiveness of our designed training mechanism.
- Published
- 2024
27. MST-GNN: graph neural network with multi-granularity in space and time for traffic prediction
- Author
-
Zhao, Xinru, Yu, Wenhao, and Zhang, Yifan
- Published
- 2024
- Full Text
- View/download PDF
28. LocoMan: Advancing Versatile Quadrupedal Dexterity with Lightweight Loco-Manipulators
- Author
-
Lin, Changyi, Liu, Xingyu, Yang, Yuxiang, Niu, Yaru, Yu, Wenhao, Zhang, Tingnan, Tan, Jie, Boots, Byron, and Zhao, Ding
- Subjects
Computer Science - Robotics - Abstract
Quadrupedal robots have emerged as versatile agents capable of locomoting and manipulating in complex environments. Traditional designs typically rely on the robot's inherent body parts or incorporate top-mounted arms for manipulation tasks. However, these configurations may limit the robot's operational dexterity, efficiency and adaptability, particularly in cluttered or constrained spaces. In this work, we present LocoMan, a dexterous quadrupedal robot with a novel morphology to perform versatile manipulation in diverse constrained environments. By equipping a Unitree Go1 robot with two low-cost and lightweight modular 3-DoF loco-manipulators on its front calves, LocoMan leverages the combined mobility and functionality of the legs and grippers for complex manipulation tasks that require precise 6D positioning of the end effector in a wide workspace. To harness the loco-manipulation capabilities of LocoMan, we introduce a unified control framework that extends the whole-body controller (WBC) to integrate the dynamics of loco-manipulators. Through experiments, we validate that the proposed whole-body controller can accurately and stably follow desired 6D trajectories of the end effector and torso, which, when combined with the large workspace from our design, facilitates a diverse set of challenging dexterous loco-manipulation tasks in confined spaces, such as opening doors, plugging into sockets, picking objects in narrow and low-lying spaces, and bimanual manipulation., Comment: Project page: https://linchangyi1.github.io/LocoMan
- Published
- 2024
29. CoNVOI: Context-aware Navigation using Vision Language Models in Outdoor and Indoor Environments
- Author
-
Sathyamoorthy, Adarsh Jagan, Weerakoon, Kasun, Elnoor, Mohamed, Zore, Anuj, Ichter, Brian, Xia, Fei, Tan, Jie, Yu, Wenhao, and Manocha, Dinesh
- Subjects
Computer Science - Robotics - Abstract
We present ConVOI, a novel method for autonomous robot navigation in real-world indoor and outdoor environments using Vision Language Models (VLMs). We employ VLMs in two ways: first, we leverage their zero-shot image classification capability to identify the context or scenario (e.g., indoor corridor, outdoor terrain, crosswalk, etc) of the robot's surroundings, and formulate context-based navigation behaviors as simple text prompts (e.g. ``stay on the pavement"). Second, we utilize their state-of-the-art semantic understanding and logical reasoning capabilities to compute a suitable trajectory given the identified context. To this end, we propose a novel multi-modal visual marking approach to annotate the obstacle-free regions in the RGB image used as input to the VLM with numbers, by correlating it with a local occupancy map of the environment. The marked numbers ground image locations in the real-world, direct the VLM's attention solely to navigable locations, and elucidate the spatial relationships between them and terrains depicted in the image to the VLM. Next, we query the VLM to select numbers on the marked image that satisfy the context-based behavior text prompt, and construct a reference path using the selected numbers. Finally, we propose a method to extrapolate the reference trajectory when the robot's environmental context has not changed to prevent unnecessary VLM queries. We use the reference trajectory to guide a motion planner, and demonstrate that it leads to human-like behaviors (e.g. not cutting through a group of people, using crosswalks, etc.) in various real-world indoor and outdoor scenarios., Comment: 9 pages, 4 figures
- Published
- 2024
30. TFCounter:Polishing Gems for Training-Free Object Counting
- Author
-
Ting, Pan, Lin, Jianfeng, Yu, Wenhao, Zhang, Wenlong, Chen, Xiaoying, Zhang, Jinlu, and Huang, Binqiang
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,68 - Abstract
Object counting is a challenging task with broad application prospects in security surveillance, traffic management, and disease diagnosis. Existing object counting methods face a tri-fold challenge: achieving superior performance, maintaining high generalizability, and minimizing annotation costs. We develop a novel training-free class-agnostic object counter, TFCounter, which is prompt-context-aware via the cascade of the essential elements in large-scale foundation models. This approach employs an iterative counting framework with a dual prompt system to recognize a broader spectrum of objects varying in shape, appearance, and size. Besides, it introduces an innovative context-aware similarity module incorporating background context to enhance accuracy within messy scenes. To demonstrate cross-domain generalizability, we collect a novel counting dataset named BIKE-1000, including exclusive 1000 images of shared bicycles from Meituan. Extensive experiments on FSC-147, CARPK, and BIKE-1000 datasets demonstrate that TFCounter outperforms existing leading training-free methods and exhibits competitive results compared to trained counterparts., Comment: 14pages,11 figuers
- Published
- 2024
31. Towards Safe and Reliable Autonomous Driving: Dynamic Occupancy Set Prediction
- Author
-
Shao, Wenbo, Xu, Jiahui, Yu, Wenhao, Li, Jun, and Wang, Hong
- Subjects
Computer Science - Robotics ,Computer Science - Computer Vision and Pattern Recognition - Abstract
In the rapidly evolving field of autonomous driving, reliable prediction is pivotal for vehicular safety. However, trajectory predictions often deviate from actual paths, particularly in complex and challenging environments, leading to significant errors. To address this issue, our study introduces a novel method for Dynamic Occupancy Set (DOS) prediction, it effectively combines advanced trajectory prediction networks with a DOS prediction module, overcoming the shortcomings of existing models. It provides a comprehensive and adaptable framework for predicting the potential occupancy sets of traffic participants. The innovative contributions of this study include the development of a novel DOS prediction model specifically tailored for navigating complex scenarios, the introduction of precise DOS mathematical representations, and the formulation of optimized loss functions that collectively advance the safety and efficiency of autonomous systems. Through rigorous validation, our method demonstrates marked improvements over traditional models, establishing a new benchmark for safety and operational efficiency in intelligent transportation systems., Comment: Accepted by IEEE IV 2024
- Published
- 2024
32. StarCoder 2 and The Stack v2: The Next Generation
- Author
-
Lozhkov, Anton, Li, Raymond, Allal, Loubna Ben, Cassano, Federico, Lamy-Poirier, Joel, Tazi, Nouamane, Tang, Ao, Pykhtar, Dmytro, Liu, Jiawei, Wei, Yuxiang, Liu, Tianyang, Tian, Max, Kocetkov, Denis, Zucker, Arthur, Belkada, Younes, Wang, Zijian, Liu, Qian, Abulkhanov, Dmitry, Paul, Indraneil, Li, Zhuang, Li, Wen-Ding, Risdal, Megan, Li, Jia, Zhu, Jian, Zhuo, Terry Yue, Zheltonozhskii, Evgenii, Dade, Nii Osae Osae, Yu, Wenhao, Krauß, Lucas, Jain, Naman, Su, Yixuan, He, Xuanli, Dey, Manan, Abati, Edoardo, Chai, Yekun, Muennighoff, Niklas, Tang, Xiangru, Oblokulov, Muhtasham, Akiki, Christopher, Marone, Marc, Mou, Chenghao, Mishra, Mayank, Gu, Alex, Hui, Binyuan, Dao, Tri, Zebaze, Armel, Dehaene, Olivier, Patry, Nicolas, Xu, Canwen, McAuley, Julian, Hu, Han, Scholak, Torsten, Paquet, Sebastien, Robinson, Jennifer, Anderson, Carolyn Jane, Chapados, Nicolas, Patwary, Mostofa, Tajbakhsh, Nima, Jernite, Yacine, Ferrandis, Carlos Muñoz, Zhang, Lingming, Hughes, Sean, Wolf, Thomas, Guha, Arjun, von Werra, Leandro, and de Vries, Harm
- Subjects
Computer Science - Software Engineering ,Computer Science - Artificial Intelligence - Abstract
The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.
- Published
- 2024
33. Learning to Learn Faster from Human Feedback with Language Model Predictive Control
- Author
-
Liang, Jacky, Xia, Fei, Yu, Wenhao, Zeng, Andy, Arenas, Montserrat Gonzalez, Attarian, Maria, Bauza, Maria, Bennice, Matthew, Bewley, Alex, Dostmohamed, Adil, Fu, Chuyuan Kelly, Gileadi, Nimrod, Giustina, Marissa, Gopalakrishnan, Keerthana, Hasenclever, Leonard, Humplik, Jan, Hsu, Jasmine, Joshi, Nikhil, Jyenis, Ben, Kew, Chase, Kirmani, Sean, Lee, Tsang-Wei Edward, Lee, Kuang-Huei, Michaely, Assaf Hurwitz, Moore, Joss, Oslund, Ken, Rao, Dushyant, Ren, Allen, Tabanpour, Baruch, Vuong, Quan, Wahid, Ayzaan, Xiao, Ted, Xu, Ying, Zhuang, Vincent, Xu, Peng, Frey, Erik, Caluwaerts, Ken, Zhang, Tingnan, Ichter, Brian, Tompson, Jonathan, Takayama, Leila, Vanhoucke, Vincent, Shafran, Izhak, Mataric, Maja, Sadigh, Dorsa, Heess, Nicolas, Rao, Kanishka, Stewart, Nik, Tan, Jie, and Parada, Carolina
- Subjects
Computer Science - Robotics - Abstract
Large language models (LLMs) have been shown to exhibit a wide range of capabilities, such as writing robot code from language commands -- enabling non-experts to direct robot behaviors, modify them based on feedback, or compose them to perform new tasks. However, these capabilities (driven by in-context learning) are limited to short-term interactions, where users' feedback remains relevant for only as long as it fits within the context size of the LLM, and can be forgotten over longer interactions. In this work, we investigate fine-tuning the robot code-writing LLMs, to remember their in-context interactions and improve their teachability i.e., how efficiently they adapt to human inputs (measured by average number of corrections before the user considers the task successful). Our key observation is that when human-robot interactions are viewed as a partially observable Markov decision process (in which human language inputs are observations, and robot code outputs are actions), then training an LLM to complete previous interactions is training a transition dynamics model -- that can be combined with classic robotics techniques such as model predictive control (MPC) to discover shorter paths to success. This gives rise to Language Model Predictive Control (LMPC), a framework that fine-tunes PaLM 2 to improve its teachability on 78 tasks across 5 robot embodiments -- improving non-expert teaching success rates of unseen tasks by 26.9% while reducing the average number of human corrections from 2.4 to 1.9. Experiments show that LMPC also produces strong meta-learners, improving the success rate of in-context learning new tasks on unseen robot embodiments and APIs by 31.5%. See videos, code, and demos at: https://robot-teaching.github.io/.
- Published
- 2024
34. PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs
- Author
-
Nasiriany, Soroush, Xia, Fei, Yu, Wenhao, Xiao, Ted, Liang, Jacky, Dasgupta, Ishita, Xie, Annie, Driess, Danny, Wahid, Ayzaan, Xu, Zhuo, Vuong, Quan, Zhang, Tingnan, Lee, Tsang-Wei Edward, Lee, Kuang-Huei, Xu, Peng, Kirmani, Sean, Zhu, Yuke, Zeng, Andy, Hausman, Karol, Heess, Nicolas, Finn, Chelsea, Levine, Sergey, and Ichter, Brian
- Subjects
Computer Science - Robotics ,Computer Science - Computation and Language ,Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
Vision language models (VLMs) have shown impressive capabilities across a variety of tasks, from logical reasoning to visual understanding. This opens the door to richer interaction with the world, for example robotic control. However, VLMs produce only textual outputs, while robotic control and other spatial tasks require outputting continuous coordinates, actions, or trajectories. How can we enable VLMs to handle such settings without fine-tuning on task-specific data? In this paper, we propose a novel visual prompting approach for VLMs that we call Prompting with Iterative Visual Optimization (PIVOT), which casts tasks as iterative visual question answering. In each iteration, the image is annotated with a visual representation of proposals that the VLM can refer to (e.g., candidate robot actions, localizations, or trajectories). The VLM then selects the best ones for the task. These proposals are iteratively refined, allowing the VLM to eventually zero in on the best available answer. We investigate PIVOT on real-world robotic navigation, real-world manipulation from images, instruction following in simulation, and additional spatial inference tasks such as localization. We find, perhaps surprisingly, that our approach enables zero-shot control of robotic systems without any robot training data, navigation in a variety of environments, and other capabilities. Although current performance is far from perfect, our work highlights potentials and limitations of this new regime and shows a promising approach for Internet-Scale VLMs in robotic and spatial reasoning domains. Website: pivot-prompt.github.io and HuggingFace: https://huggingface.co/spaces/pivot-prompt/pivot-prompt-demo.
- Published
- 2024
35. WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
- Author
-
He, Hongliang, Yao, Wenlin, Ma, Kaixin, Yu, Wenhao, Dai, Yong, Zhang, Hongming, Lan, Zhenzhong, and Yu, Dong
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
The rapid advancement of large language models (LLMs) has led to a new era marked by the development of autonomous applications in real-world scenarios, which drives innovation in creating advanced web agents. Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots, greatly limiting their applicability in real-world scenarios. To bridge this gap, we introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites. Moreover, we establish a new benchmark by compiling real-world tasks from 15 popular websites and introduce an automatic evaluation protocol leveraging multimodal understanding abilities of GPT-4V to evaluate open-ended web agents. We show that WebVoyager achieves a 59.1% task success rate on our benchmark, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups, underscoring the exceptional capability of WebVoyager. The proposed automatic evaluation metric achieves 85.3% agreement with human judgment, indicating its effectiveness in providing reliable and accurate assessments of web agents., Comment: Accepted to ACL 2024 (main). Code and data is released at https://github.com/MinorJerry/WebVoyager
- Published
- 2024
36. The Evolution of China’s National Innovation System During the 40 Years of Reform and Opening Up
- Author
-
Yu Wenhao
- Subjects
reform and opening up ,science and technology (s&t) systems ,national innovation system (nis) ,independent innovation ,Social Sciences - Published
- 2019
- Full Text
- View/download PDF
37. Gradient Shaping for Multi-Constraint Safe Reinforcement Learning
- Author
-
Yao, Yihang, Liu, Zuxin, Cen, Zhepeng, Huang, Peide, Zhang, Tingnan, Yu, Wenhao, and Zhao, Ding
- Subjects
Computer Science - Machine Learning - Abstract
Online safe reinforcement learning (RL) involves training a policy that maximizes task efficiency while satisfying constraints via interacting with the environments. In this paper, our focus lies in addressing the complex challenges associated with solving multi-constraint (MC) safe RL problems. We approach the safe RL problem from the perspective of Multi-Objective Optimization (MOO) and propose a unified framework designed for MC safe RL algorithms. This framework highlights the manipulation of gradients derived from constraints. Leveraging insights from this framework and recognizing the significance of \textit{redundant} and \textit{conflicting} constraint conditions, we introduce the Gradient Shaping (GradS) method for general Lagrangian-based safe RL algorithms to improve the training efficiency in terms of both reward and constraint satisfaction. Our extensive experimentation demonstrates the effectiveness of our proposed method in encouraging exploration and learning a policy that improves both safety and reward performance across various challenging MC safe RL tasks as well as good scalability to the number of constraints.
- Published
- 2023
38. Dense X Retrieval: What Retrieval Granularity Should We Use?
- Author
-
Chen, Tong, Wang, Hongwei, Chen, Sihao, Yu, Wenhao, Ma, Kaixin, Zhao, Xinran, Zhang, Hongming, and Yu, Dong
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence ,Computer Science - Information Retrieval - Abstract
Dense retrieval has become a prominent method to obtain relevant context or world knowledge in open-domain NLP tasks. When we use a learned dense retriever on a retrieval corpus at inference time, an often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence. We discover that the retrieval unit choice significantly impacts the performance of both retrieval and downstream tasks. Distinct from the typical approach of using passages or sentences, we introduce a novel retrieval unit, proposition, for dense retrieval. Propositions are defined as atomic expressions within text, each encapsulating a distinct factoid and presented in a concise, self-contained natural language format. We conduct an empirical comparison of different retrieval granularity. Our experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks. Moreover, constructing prompts with fine-grained retrieved units for retrieval-augmented language models improves the performance of downstream QA tasks given a specific computation budget.
- Published
- 2023
39. PLUG: Leveraging Pivot Language in Cross-Lingual Instruction Tuning
- Author
-
Zhang, Zhihan, Lee, Dong-Ho, Fang, Yuwei, Yu, Wenhao, Jia, Mengzhao, Jiang, Meng, and Barbieri, Francesco
- Subjects
Computer Science - Computation and Language - Abstract
Instruction tuning has remarkably advanced large language models (LLMs) in understanding and responding to diverse human instructions. Despite the success in high-resource languages, its application in lower-resource ones faces challenges due to the imbalanced foundational abilities of LLMs across different languages, stemming from the uneven language distribution in their pre-training data. To tackle this issue, we propose pivot language guided generation (PLUG), an approach that utilizes a high-resource language, primarily English, as the pivot to enhance instruction tuning in lower-resource languages. It trains the model to first process instructions in the pivot language, and then produce responses in the target language. To evaluate our approach, we introduce a benchmark, X-AlpacaEval, of instructions in 4 languages (Chinese, Korean, Italian, and Spanish), each annotated by professional translators. Our approach demonstrates a significant improvement in the instruction-following abilities of LLMs by 29% on average, compared to directly responding in the target language alone. Further experiments validate the versatility of our approach by employing alternative pivot languages beyond English to assist languages where LLMs exhibit lower proficiency. Our code and data are available at https://github.com/ytyz1307zzh/PLUG.
- Published
- 2023
40. Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models
- Author
-
Yu, Wenhao, Zhang, Hongming, Pan, Xiaoman, Ma, Kaixin, Wang, Hongwei, and Yu, Dong
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Retrieval-augmented language models (RALMs) represent a substantial advancement in the capabilities of large language models, notably in reducing factual hallucination by leveraging external knowledge sources. However, the reliability of the retrieved information is not always guaranteed. The retrieval of irrelevant data can lead to misguided responses, and potentially causing the model to overlook its inherent knowledge, even when it possesses adequate information to address the query. Moreover, standard RALMs often struggle to assess whether they possess adequate knowledge, both intrinsic and retrieved, to provide an accurate answer. In situations where knowledge is lacking, these systems should ideally respond with "unknown" when the answer is unattainable. In response to these challenges, we introduces Chain-of-Noting (CoN), a novel approach aimed at improving the robustness of RALMs in facing noisy, irrelevant documents and in handling unknown scenarios. The core idea of CoN is to generate sequential reading notes for retrieved documents, enabling a thorough evaluation of their relevance to the given question and integrating this information to formulate the final answer. We employed ChatGPT to create training data for CoN, which was subsequently trained on an LLaMa-2 7B model. Our experiments across four open-domain QA benchmarks show that RALMs equipped with CoN significantly outperform standard RALMs. Notably, CoN achieves an average improvement of +7.9 in EM score given entirely noisy retrieved documents and +10.5 in rejection rates for real-time questions that fall outside the pre-training knowledge scope., Comment: EMNLP 2024 (main conference)
- Published
- 2023
41. Sub-Sentence Encoder: Contrastive Learning of Propositional Semantic Representations
- Author
-
Chen, Sihao, Zhang, Hongming, Chen, Tong, Zhou, Ben, Yu, Wenhao, Yu, Dian, Peng, Baolin, Wang, Hongwei, Roth, Dan, and Yu, Dong
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
We introduce sub-sentence encoder, a contrastively-learned contextual embedding model for fine-grained semantic representation of text. In contrast to the standard practice with sentence embeddings, where the meaning of an entire sequence of text is encoded into a fixed-length vector, the sub-sentence encoder learns to produce distinct contextual embeddings corresponding to different atomic propositions, i.e. atomic units of meaning expressed within a text sequence. The sub-sentence embeddings are contrastively learned to recognize (inferred) semantic equivalence between propositions across different text sequences. Our experiments show the effectiveness of sub-sentence encoders in applications, such as retrieving supporting facts for fine-grained text attribution or recognizing the conditional semantic similarity between texts. In practice, we demonstrate that sub-sentence encoders keep the same level of inference cost and space complexity compared to sentence encoders.
- Published
- 2023
42. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
- Author
-
Gu, Jiayuan, Kirmani, Sean, Wohlhart, Paul, Lu, Yao, Arenas, Montserrat Gonzalez, Rao, Kanishka, Yu, Wenhao, Fu, Chuyuan, Gopalakrishnan, Keerthana, Xu, Zhuo, Sundaresan, Priya, Xu, Peng, Su, Hao, Hausman, Karol, Finn, Chelsea, Vuong, Quan, and Xiao, Ted
- Subjects
Computer Science - Robotics ,Computer Science - Artificial Intelligence - Abstract
Generalization remains one of the most important desiderata for robust robot learning systems. While recently proposed approaches show promise in generalization to novel objects, semantic concepts, or visual distribution shifts, generalization to new tasks remains challenging. For example, a language-conditioned policy trained on pick-and-place tasks will not be able to generalize to a folding task, even if the arm trajectory of folding is similar to pick-and-place. Our key insight is that this kind of generalization becomes feasible if we represent the task through rough trajectory sketches. We propose a policy conditioning method using such rough trajectory sketches, which we call RT-Trajectory, that is practical, easy to specify, and allows the policy to effectively perform new tasks that would otherwise be challenging to perform. We find that trajectory sketches strike a balance between being detailed enough to express low-level motion-centric guidance while being coarse enough to allow the learned policy to interpret the trajectory sketch in the context of situational visual observations. In addition, we show how trajectory sketches can provide a useful interface to communicate with robotic policies: they can be specified through simple human inputs like drawings or videos, or through automated methods such as modern image-generating or waypoint-generating methods. We evaluate RT-Trajectory at scale on a variety of real-world robotic tasks, and find that RT-Trajectory is able to perform a wider range of tasks compared to language-conditioned and goal-conditioned policies, when provided the same training data., Comment: Evaluation videos can be found at https://rt-trajectory.github.io/
- Published
- 2023
43. FMRT: Learning Accurate Feature Matching with Reconciliatory Transformer
- Author
-
Zhang, Xinyu, Wang, Li, Jiang, Zhiqiang, Dai, Kun, Xie, Tao, Yang, Lei, Yu, Wenhao, Shen, Yang, and Li, Jun
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Local Feature Matching, an essential component of several computer vision tasks (e.g., structure from motion and visual localization), has been effectively settled by Transformer-based methods. However, these methods only integrate long-range context information among keypoints with a fixed receptive field, which constrains the network from reconciling the importance of features with different receptive fields to realize complete image perception, hence limiting the matching accuracy. In addition, these methods utilize a conventional handcrafted encoding approach to integrate the positional information of keypoints into the visual descriptors, which limits the capability of the network to extract reliable positional encoding message. In this study, we propose Feature Matching with Reconciliatory Transformer (FMRT), a novel Transformer-based detector-free method that reconciles different features with multiple receptive fields adaptively and utilizes parallel networks to realize reliable positional encoding. Specifically, FMRT proposes a dedicated Reconciliatory Transformer (RecFormer) that consists of a Global Perception Attention Layer (GPAL) to extract visual descriptors with different receptive fields and integrate global context information under various scales, Perception Weight Layer (PWL) to measure the importance of various receptive fields adaptively, and Local Perception Feed-forward Network (LPFFN) to extract deep aggregated multi-scale local feature representation. Extensive experiments demonstrate that FMRT yields extraordinary performance on multiple benchmarks, including pose estimation, visual localization, homography estimation, and image matching.
- Published
- 2023
44. PathRL: An End-to-End Path Generation Method for Collision Avoidance via Deep Reinforcement Learning
- Author
-
Yu, Wenhao, Peng, Jie, Qiu, Quecheng, Wang, Hanyu, Zhang, Lu, and Ji, Jianmin
- Subjects
Computer Science - Robotics ,Computer Science - Artificial Intelligence - Abstract
Robot navigation using deep reinforcement learning (DRL) has shown great potential in improving the performance of mobile robots. Nevertheless, most existing DRL-based navigation methods primarily focus on training a policy that directly commands the robot with low-level controls, like linear and angular velocities, which leads to unstable speeds and unsmooth trajectories of the robot during the long-term execution. An alternative method is to train a DRL policy that outputs the navigation path directly. However, two roadblocks arise for training a DRL policy that outputs paths: (1) The action space for potential paths often involves higher dimensions comparing to low-level commands, which increases the difficulties of training; (2) It takes multiple time steps to track a path instead of a single time step, which requires the path to predicate the interactions of the robot w.r.t. the dynamic environment in multiple time steps. This, in turn, amplifies the challenges associated with training. In response to these challenges, we propose PathRL, a novel DRL method that trains the policy to generate the navigation path for the robot. Specifically, we employ specific action space discretization techniques and tailored state space representation methods to address the associated challenges. In our experiments, PathRL achieves better success rates and reduces angular rotation variability compared to other DRL navigation methods, facilitating stable and smooth robot movement. We demonstrate the competitive edge of PathRL in both real-world scenarios and multiple challenging simulation environments.
- Published
- 2023
45. Auto-Instruct: Automatic Instruction Generation and Ranking for Black-Box Language Models
- Author
-
Zhang, Zhihan, Wang, Shuohang, Yu, Wenhao, Xu, Yichong, Iter, Dan, Zeng, Qingkai, Liu, Yang, Zhu, Chenguang, and Jiang, Meng
- Subjects
Computer Science - Computation and Language - Abstract
Large language models (LLMs) can perform a wide range of tasks by following natural language instructions, without the necessity of task-specific fine-tuning. Unfortunately, the performance of LLMs is greatly influenced by the quality of these instructions, and manually writing effective instructions for each task is a laborious and subjective process. In this paper, we introduce Auto-Instruct, a novel method to automatically improve the quality of instructions provided to LLMs. Our method leverages the inherent generative ability of LLMs to produce diverse candidate instructions for a given task, and then ranks them using a scoring model trained on a variety of 575 existing NLP tasks. In experiments on 118 out-of-domain tasks, Auto-Instruct surpasses both human-written instructions and existing baselines of LLM-generated instructions. Furthermore, our method exhibits notable generalizability even with other LLMs that are not incorporated into its training process., Comment: Accepted to EMNLP 2023 Findings. Work was done before July 2023
- Published
- 2023
46. Creative Robot Tool Use with Large Language Models
- Author
-
Xu, Mengdi, Huang, Peide, Yu, Wenhao, Liu, Shiqi, Zhang, Xilun, Niu, Yaru, Zhang, Tingnan, Xia, Fei, Tan, Jie, and Zhao, Ding
- Subjects
Computer Science - Robotics ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning - Abstract
Tool use is a hallmark of advanced intelligence, exemplified in both animal behavior and robotic capabilities. This paper investigates the feasibility of imbuing robots with the ability to creatively use tools in tasks that involve implicit physical constraints and long-term planning. Leveraging Large Language Models (LLMs), we develop RoboTool, a system that accepts natural language instructions and outputs executable code for controlling robots in both simulated and real-world environments. RoboTool incorporates four pivotal components: (i) an "Analyzer" that interprets natural language to discern key task-related concepts, (ii) a "Planner" that generates comprehensive strategies based on the language input and key concepts, (iii) a "Calculator" that computes parameters for each skill, and (iv) a "Coder" that translates these plans into executable Python code. Our results show that RoboTool can not only comprehend explicit or implicit physical constraints and environmental factors but also demonstrate creative tool use. Unlike traditional Task and Motion Planning (TAMP) methods that rely on explicit optimization, our LLM-based system offers a more flexible, efficient, and user-friendly solution for complex robotics tasks. Through extensive experiments, we validate that RoboTool is proficient in handling tasks that would otherwise be infeasible without the creative use of tools, thereby expanding the capabilities of robotic systems. Demos are available on our project page: https://creative-robotool.github.io/., Comment: 19 pages, 14 figures, 2 tables
- Published
- 2023
47. Toward Intelligent Emergency Control for Large-scale Power Systems: Convergence of Learning, Physics, Computing and Control
- Author
-
Huang, Qiuhua, Huang, Renke, Yin, Tianzhixi, Datta, Sohom, Sun, Xueqing, Hou, Jason, Tan, Jie, Yu, Wenhao, Liu, Yuan, Li, Xinya, Palmer, Bruce, Li, Ang, Ke, Xinda, Vaiman, Marianna, Wang, Song, and Chen, Yousu
- Subjects
Electrical Engineering and Systems Science - Systems and Control - Abstract
This paper has delved into the pressing need for intelligent emergency control in large-scale power systems, which are experiencing significant transformations and are operating closer to their limits with more uncertainties. Learning-based control methods are promising and have shown effectiveness for intelligent power system control. However, when they are applied to large-scale power systems, there are multifaceted challenges such as scalability, adaptiveness, and security posed by the complex power system landscape, which demand comprehensive solutions. The paper first proposes and instantiates a convergence framework for integrating power systems physics, machine learning, advanced computing, and grid control to realize intelligent grid control at a large scale. Our developed methods and platform based on the convergence framework have been applied to a large (more than 3000 buses) Texas power system, and tested with 56000 scenarios. Our work achieved a 26% reduction in load shedding on average and outperformed existing rule-based control in 99.7% of the test scenarios. The results demonstrated the potential of the proposed convergence framework and DRL-based intelligent control for the future grid., Comment: submitted to PSCC 2024
- Published
- 2023
48. Constraint-Conditioned Policy Optimization for Versatile Safe Reinforcement Learning
- Author
-
Yao, Yihang, Liu, Zuxin, Cen, Zhepeng, Zhu, Jiacheng, Yu, Wenhao, Zhang, Tingnan, and Zhao, Ding
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence - Abstract
Safe reinforcement learning (RL) focuses on training reward-maximizing agents subject to pre-defined safety constraints. Yet, learning versatile safe policies that can adapt to varying safety constraint requirements during deployment without retraining remains a largely unexplored and challenging area. In this work, we formulate the versatile safe RL problem and consider two primary requirements: training efficiency and zero-shot adaptation capability. To address them, we introduce the Conditioned Constrained Policy Optimization (CCPO) framework, consisting of two key modules: (1) Versatile Value Estimation (VVE) for approximating value functions under unseen threshold conditions, and (2) Conditioned Variational Inference (CVI) for encoding arbitrary constraint thresholds during policy optimization. Our extensive experiments demonstrate that CCPO outperforms the baselines in terms of safety and task performance while preserving zero-shot adaptation capabilities to different constraint thresholds data-efficiently. This makes our approach suitable for real-world dynamic applications.
- Published
- 2023
49. GraphPatcher: Mitigating Degree Bias for Graph Neural Networks via Test-time Augmentation
- Author
-
Ju, Mingxuan, Zhao, Tong, Yu, Wenhao, Shah, Neil, and Ye, Yanfang
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence - Abstract
Recent studies have shown that graph neural networks (GNNs) exhibit strong biases towards the node degree: they usually perform satisfactorily on high-degree nodes with rich neighbor information but struggle with low-degree nodes. Existing works tackle this problem by deriving either designated GNN architectures or training strategies specifically for low-degree nodes. Though effective, these approaches unintentionally create an artificial out-of-distribution scenario, where models mainly or even only observe low-degree nodes during the training, leading to a downgraded performance for high-degree nodes that GNNs originally perform well at. In light of this, we propose a test-time augmentation framework, namely GraphPatcher, to enhance test-time generalization of any GNNs on low-degree nodes. Specifically, GraphPatcher iteratively generates virtual nodes to patch artificially created low-degree nodes via corruptions, aiming at progressively reconstructing target GNN's predictions over a sequence of increasingly corrupted nodes. Through this scheme, GraphPatcher not only learns how to enhance low-degree nodes (when the neighborhoods are heavily corrupted) but also preserves the original superior performance of GNNs on high-degree nodes (when lightly corrupted). Additionally, GraphPatcher is model-agnostic and can also mitigate the degree bias for either self-supervised or supervised GNNs. Comprehensive experiments are conducted over seven benchmark datasets and GraphPatcher consistently enhances common GNNs' overall performance by up to 3.6% and low-degree performance by up to 6.5%, significantly outperforming state-of-the-art baselines. The source code is publicly available at https://github.com/jumxglhf/GraphPatcher., Comment: NeurIPS'23
- Published
- 2023
50. Prognostic and clinicopathological significance of tertiary lymphoid structure in non-small cell lung cancer: a systematic review and meta-analysis
- Author
-
Ma, Luyuan, Li, Rongyang, Liu, Xiaomeng, Yu, Wenhao, Tang, Zhanpeng, Shen, Yi, and Tian, Hui
- Published
- 2024
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.