7,469 results on '"Dong, Hao"'
Search Results
2. Is Precise Recovery Necessary? A Task-Oriented Imputation Approach for Time Series Forecasting on Variable Subset
- Author
-
Hao, Qi, Liang, Runchang, Gao, Yue, Dong, Hao, Fan, Wei, Jiang, Lu, and Wang, Pengyang
- Subjects
Computer Science - Machine Learning - Abstract
Variable Subset Forecasting (VSF) refers to a unique scenario in multivariate time series forecasting, where available variables in the inference phase are only a subset of the variables in the training phase. VSF presents significant challenges as the entire time series may be missing, and neither inter- nor intra-variable correlations persist. Such conditions impede the effectiveness of traditional imputation methods, primarily focusing on filling in individual missing data points. Inspired by the principle of feature engineering that not all variables contribute positively to forecasting, we propose Task-Oriented Imputation for VSF (TOI-VSF), a novel framework shifts the focus from accurate data recovery to directly support the downstream forecasting task. TOI-VSF incorporates a self-supervised imputation module, agnostic to the forecasting model, designed to fill in missing variables while preserving the vital characteristics and temporal patterns of time series data. Additionally, we implement a joint learning strategy for imputation and forecasting, ensuring that the imputation process is directly aligned with and beneficial to the forecasting objective. Extensive experiments across four datasets demonstrate the superiority of TOI-VSF, outperforming baseline methods by $15\%$ on average.
- Published
- 2024
3. DPU: Dynamic Prototype Updating for Multimodal Out-of-Distribution Detection
- Author
-
Li, Shawn, Gong, Huixian, Dong, Hao, Yang, Tiankai, Tu, Zhengzhong, and Zhao, Yue
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
Out-of-distribution (OOD) detection is essential for ensuring the robustness of machine learning models by identifying samples that deviate from the training distribution. While traditional OOD detection has primarily focused on single-modality inputs, such as images, recent advances in multimodal models have demonstrated the potential of leveraging multiple modalities (e.g., video, optical flow, audio) to enhance detection performance. However, existing methods often overlook intra-class variability within in-distribution (ID) data, assuming that samples of the same class are perfectly cohesive and consistent. This assumption can lead to performance degradation, especially when prediction discrepancies are uniformly amplified across all samples. To address this issue, we propose Dynamic Prototype Updating (DPU), a novel plug-and-play framework for multimodal OOD detection that accounts for intra-class variations. Our method dynamically updates class center representations for each class by measuring the variance of similar samples within each batch, enabling adaptive adjustments. This approach allows us to amplify prediction discrepancies based on the updated class centers, thereby improving the model's robustness and generalization across different modalities. Extensive experiments on two tasks, five datasets, and nine base OOD algorithms demonstrate that DPU significantly improves OOD detection performance, setting a new state-of-the-art in multimodal OOD detection, with improvements of up to 80 percent in Far-OOD detection. To facilitate accessibility and reproducibility, our code is publicly available on GitHub.
- Published
- 2024
4. ET-SEED: Efficient Trajectory-Level SE(3) Equivariant Diffusion Policy
- Author
-
Tie, Chenrui, Chen, Yue, Wu, Ruihai, Dong, Boxuan, Li, Zeyi, Gao, Chongkai, and Dong, Hao
- Subjects
Computer Science - Robotics ,Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
Imitation learning, e.g., diffusion policy, has been proven effective in various robotic manipulation tasks. However, extensive demonstrations are required for policy robustness and generalization. To reduce the demonstration reliance, we leverage spatial symmetry and propose ET-SEED, an efficient trajectory-level SE(3) equivariant diffusion model for generating action sequences in complex robot manipulation tasks. Further, previous equivariant diffusion models require the per-step equivariance in the Markov process, making it difficult to learn policy under such strong constraints. We theoretically extend equivariant Markov kernels and simplify the condition of equivariant diffusion process, thereby significantly improving training efficiency for trajectory-level SE(3) equivariant diffusion policy in an end-to-end manner. We evaluate ET-SEED on representative robotic manipulation tasks, involving rigid body, articulated and deformable object. Experiments demonstrate superior data efficiency and manipulation proficiency of our proposed method, as well as its ability to generalize to unseen configurations with only a few demonstrations. Website: https://et-seed.github.io/, Comment: Accept to CoRL 2024 Workshop on X-Embodiment Robot Learning
- Published
- 2024
5. GarmentLab: A Unified Simulation and Benchmark for Garment Manipulation
- Author
-
Lu, Haoran, Wu, Ruihai, Li, Yitong, Li, Sijie, Zhu, Ziyu, Ning, Chuanruo, Shen, Yan, Luo, Longzan, Chen, Yuanpei, and Dong, Hao
- Subjects
Computer Science - Robotics ,Computer Science - Artificial Intelligence ,Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Human-Computer Interaction - Abstract
Manipulating garments and fabrics has long been a critical endeavor in the development of home-assistant robots. However, due to complex dynamics and topological structures, garment manipulations pose significant challenges. Recent successes in reinforcement learning and vision-based methods offer promising avenues for learning garment manipulation. Nevertheless, these approaches are severely constrained by current benchmarks, which offer limited diversity of tasks and unrealistic simulation behavior. Therefore, we present GarmentLab, a content-rich benchmark and realistic simulation designed for deformable object and garment manipulation. Our benchmark encompasses a diverse range of garment types, robotic systems and manipulators. The abundant tasks in the benchmark further explores of the interactions between garments, deformable objects, rigid bodies, fluids, and human body. Moreover, by incorporating multiple simulation methods such as FEM and PBD, along with our proposed sim-to-real algorithms and real-world benchmark, we aim to significantly narrow the sim-to-real gap. We evaluate state-of-the-art vision methods, reinforcement learning, and imitation learning approaches on these tasks, highlighting the challenges faced by current algorithms, notably their limited generalization capabilities. Our proposed open-source environments and comprehensive analysis show promising boost to future research in garment manipulation by unlocking the full potential of these methods. We guarantee that we will open-source our code as soon as possible. You can watch the videos in supplementary files to learn more about the details of our work. Our project page is available at: https://garmentlab.github.io/, Comment: NeurIPS 2024
- Published
- 2024
6. Role of Wettability, Adhesion, and Instabilities in Transitions During Lubricated Sliding Friction
- Author
-
Dong, Hao, Siddiquie, Reshma, Xiao, Xuemei, Andrews, Michael, Bergman, Brian, Hui, Chung-Yuen, and Jagota, Anand
- Subjects
Condensed Matter - Soft Condensed Matter ,Condensed Matter - Materials Science - Abstract
Lubricated contacts in soft materials are important in various engineering systems and natural settings. Three major lubrication regimes are boundary (BL), mixed (ML), and elasto-hydrodynamic (EHL) lubrication, where the contact region is dry, partially wetted, or fully wetted, respectively. The transition between these regimes is insufficiently understood, especially for soft contacts, which impedes desired control of lubricated sliding friction. Here, we report on the role of solid wettability and adhesion on these transitions. Wettability of glycerol on polydimethylsiloxane (PDMS) surface, and adhesion between a glass indenter and PDMS, were varied by exposure of the PDMS to an ultraviolet light-ozone (UV-Ozone) cleaner. By combining friction tests and visualization, we demonstrate that the transition from ML to BL regime is dominated by the wettability of the lubricant; increasing wettability of glycerol makes removal of liquid from the contact region more difficult. Transition from EHL to ML is related to a series of events with increasing normal load, which are thinning of the lubricant layer, sudden jump to contact between the glass indenter and solid substrate across a gap of tens to a few hundreds of nanometers, and attendant elastic instabilities such as wrinkling and stick-slip. These results provide a deeper understanding of transitions in lubricated frictional behavior of soft materials which govern the maximum and minimum friction achievable.
- Published
- 2024
7. TeaserGen: Generating Teasers for Long Documentaries
- Author
-
Xu, Weihan, Liang, Paul Pu, Kim, Haven, McAuley, Julian, Berg-Kirkpatrick, Taylor, and Dong, Hao-Wen
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
Teasers are an effective tool for promoting content in entertainment, commercial and educational fields. However, creating an effective teaser for long videos is challenging for it requires long-range multimodal modeling on the input videos, while necessitating maintaining audiovisual alignments, managing scene changes and preserving factual accuracy for the output teasers. Due to the lack of a publicly-available dataset, progress along this research direction has been hindered. In this work, we present DocumentaryNet, a collection of 1,269 documentaries paired with their teasers, featuring multimodal data streams of video, speech, music, sound effects and narrations. With DocumentaryNet, we propose a new two-stage system for generating teasers from long documentaries. The proposed TeaserGen system first generates the teaser narration from the transcribed narration of the documentary using a pretrained large language model, and then selects the most relevant visual content to accompany the generated narration through language-vision models. For narration-video matching, we explore two approaches: a pretraining-based model using pretrained contrastive language-vision models and a deep sequential model that learns the mapping between the narrations and visuals. Our experimental results show that the pretraining-based approach is more effective at identifying relevant visual content than directly trained deep autoregressive models.
- Published
- 2024
8. MO-DDN: A Coarse-to-Fine Attribute-based Exploration Agent for Multi-object Demand-driven Navigation
- Author
-
Wang, Hongcheng, Liu, Peiqi, Cai, Wenzhe, Wu, Mingdong, Qian, Zhengyu, and Dong, Hao
- Subjects
Computer Science - Robotics - Abstract
The process of satisfying daily demands is a fundamental aspect of humans' daily lives. With the advancement of embodied AI, robots are increasingly capable of satisfying human demands. Demand-driven navigation (DDN) is a task in which an agent must locate an object to satisfy a specified demand instruction, such as ``I am thirsty.'' The previous study typically assumes that each demand instruction requires only one object to be fulfilled and does not consider individual preferences. However, the realistic human demand may involve multiple objects. In this paper, we introduce the Multi-object Demand-driven Navigation (MO-DDN) benchmark, which addresses these nuanced aspects, including multi-object search and personal preferences, thus making the MO-DDN task more reflective of real-life scenarios compared to DDN. Building upon previous work, we employ the concept of ``attribute'' to tackle this new task. However, instead of solely relying on attribute features in an end-to-end manner like DDN, we propose a modular method that involves constructing a coarse-to-fine attribute-based exploration agent (C2FAgent). Our experimental results illustrate that this coarse-to-fine exploration strategy capitalizes on the advantages of attributes at various decision-making levels, resulting in superior performance compared to baseline methods. Code and video can be found at https://sites.google.com/view/moddn., Comment: Accepted at NeurIPS 2024; 39 pages, 11 figures
- Published
- 2024
9. Generating Symbolic Music from Natural Language Prompts using an LLM-Enhanced Dataset
- Author
-
Xu, Weihan, McAuley, Julian, Berg-Kirkpatrick, Taylor, Dubnov, Shlomo, and Dong, Hao-Wen
- Subjects
Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Recent years have seen many audio-domain text-to-music generation models that rely on large amounts of text-audio pairs for training. However, symbolic-domain controllable music generation has lagged behind partly due to the lack of a large-scale symbolic music dataset with extensive metadata and captions. In this work, we present MetaScore, a new dataset consisting of 963K musical scores paired with rich metadata, including free-form user-annotated tags, collected from an online music forum. To approach text-to-music generation, we leverage a pretrained large language model (LLM) to generate pseudo natural language captions from the metadata. With the LLM-enhanced MetaScore, we train a text-conditioned music generation model that learns to generate symbolic music from the pseudo captions, allowing control of instruments, genre, composer, complexity and other free-form music descriptors. In addition, we train a tag-conditioned system that supports a predefined set of tags available in MetaScore. Our experimental results show that both the proposed text-to-music and tags-to-music models outperform a baseline text-to-music model in a listening test, while the text-based system offers a more natural interface that allows free-form natural language prompts.
- Published
- 2024
10. RAD: A Dataset and Benchmark for Real-Life Anomaly Detection with Robotic Observations
- Author
-
Zhou, Kaichen, Cao, Yang, Kim, Taewhan, Zhao, Hao, Dong, Hao, Ting, Kai Ming, and Zhu, Ye
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Recent advancements in industrial anomaly detection have been hindered by the lack of realistic datasets that accurately represent real-world conditions. Existing algorithms are often developed and evaluated using idealized datasets, which deviate significantly from real-life scenarios characterized by environmental noise and data corruption such as fluctuating lighting conditions, variable object poses, and unstable camera positions. To address this gap, we introduce the Realistic Anomaly Detection (RAD) dataset, the first multi-view RGB-based anomaly detection dataset specifically collected using a real robot arm, providing unique and realistic data scenarios. RAD comprises 4765 images across 13 categories and 4 defect types, collected from more than 50 viewpoints, providing a comprehensive and realistic benchmark. This multi-viewpoint setup mirrors real-world conditions where anomalies may not be detectable from every perspective. Moreover, by sampling varying numbers of views, the algorithm's performance can be comprehensively evaluated across different viewpoints. This approach enhances the thoroughness of performance assessment and helps improve the algorithm's robustness. Besides, to support 3D multi-view reconstruction algorithms, we propose a data augmentation method to improve the accuracy of pose estimation and facilitate the reconstruction of 3D point clouds. We systematically evaluate state-of-the-art RGB-based and point cloud-based models using RAD, identifying limitations and future research directions. The code and dataset could found at https://github.com/kaichen-z/RAD
- Published
- 2024
11. Canonical Representation and Force-Based Pretraining of 3D Tactile for Dexterous Visuo-Tactile Policy Learning
- Author
-
Wu, Tianhao, Li, Jinzhou, Zhang, Jiyao, Wu, Mingdong, and Dong, Hao
- Subjects
Computer Science - Robotics - Abstract
Tactile sensing plays a vital role in enabling robots to perform fine-grained, contact-rich tasks. However, the high dimensionality of tactile data, due to the large coverage on dexterous hands, poses significant challenges for effective tactile feature learning, especially for 3D tactile data, as there are no large standardized datasets and no strong pretrained backbones. To address these challenges, we propose a novel canonical representation that reduces the difficulty of 3D tactile feature learning and further introduces a force-based self-supervised pretraining task to capture both local and net force features, which are crucial for dexterous manipulation. Our method achieves an average success rate of 78% across four fine-grained, contact-rich dexterous manipulation tasks in real-world experiments, demonstrating effectiveness and robustness compared to other methods. Further analysis shows that our method fully utilizes both spatial and force information from 3D tactile data to accomplish the tasks. The videos can be viewed at https://3dtacdex.github.io.
- Published
- 2024
12. ViolinDiff: Enhancing Expressive Violin Synthesis with Pitch Bend Conditioning
- Author
-
Kim, Daewoong, Dong, Hao-Wen, and Jeong, Dasaem
- Subjects
Computer Science - Sound ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning ,Electrical Engineering and Systems Science - Audio and Speech Processing ,Electrical Engineering and Systems Science - Signal Processing - Abstract
Modeling the natural contour of fundamental frequency (F0) plays a critical role in music audio synthesis. However, transcribing and managing multiple F0 contours in polyphonic music is challenging, and explicit F0 contour modeling has not yet been explored for polyphonic instrumental synthesis. In this paper, we present ViolinDiff, a two-stage diffusion-based synthesis framework. For a given violin MIDI file, the first stage estimates the F0 contour as pitch bend information, and the second stage generates mel spectrogram incorporating these expressive details. The quantitative metrics and listening test results show that the proposed model generates more realistic violin sounds than the model without explicit pitch bend modeling. Audio samples are available online: daewoung.github.io/ViolinDiff-Demo.
- Published
- 2024
13. Topological flat bands in hyperbolic lattices
- Author
-
Guan, Dong-Hao, Qi, Lu, Zhou, Yuan, He, Ai-Lei, and Wang, Yi-Fei
- Subjects
Condensed Matter - Strongly Correlated Electrons - Abstract
Topological flat bands (TFBs) provide a promising platform to investigate intriguing fractionalization phenomena, such as the fractional Chern insulators (FCIs). Most of TFB models are established in two-dimensional Euclidean lattices with zero curvature. In this work, we systematically explore TFBs in a class of two-dimensional non-Euclidean lattices with constant negative curvature, {\emph i.e.,} the hyperbolic analogs of the kagome lattice. Based on the Abelian hyperbolic band theory, TFBs have been respectively found in the heptagon-kagome, the octagon-kagome, the nonagon-kagome and the decagon-kagome lattices by introducing staggered magnetic fluxes and the next nearest-neighbor hoppings. The flatness ratios of all hyperbolic TFB models are more than 15, which suggests that the hyperbolic FCIs can be realized in these TFB models. We further demonstrate the existence of a $\nu=1/2$ FCI state with open boundary conditions when hard-core bosons fill into these hyperbolic TFB models., Comment: 7 pages, 5 figures, comments are welcome
- Published
- 2024
14. CoSEC: A Coaxial Stereo Event Camera Dataset for Autonomous Driving
- Author
-
Peng, Shihan, Zhou, Hanyu, Dong, Hao, Shi, Zhiwei, Liu, Haoyue, Duan, Yuxing, Chang, Yi, and Yan, Luxin
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Conventional frame camera is the mainstream sensor of the autonomous driving scene perception, while it is limited in adverse conditions, such as low light. Event camera with high dynamic range has been applied in assisting frame camera for the multimodal fusion, which relies heavily on the pixel-level spatial alignment between various modalities. Typically, existing multimodal datasets mainly place event and frame cameras in parallel and directly align them spatially via warping operation. However, this parallel strategy is less effective for multimodal fusion, since the large disparity exacerbates spatial misalignment due to the large event-frame baseline. We argue that baseline minimization can reduce alignment error between event and frame cameras. In this work, we introduce hybrid coaxial event-frame devices to build the multimodal system, and propose a coaxial stereo event camera (CoSEC) dataset for autonomous driving. As for the multimodal system, we first utilize the microcontroller to achieve time synchronization, and then spatially calibrate different sensors, where we perform intra- and inter-calibration of stereo coaxial devices. As for the multimodal dataset, we filter LiDAR point clouds to generate depth and optical flow labels using reference depth, which is further improved by fusing aligned event and frame data in nighttime conditions. With the help of the coaxial device, the proposed dataset can promote the all-day pixel-level multimodal fusion. Moreover, we also conduct experiments to demonstrate that the proposed dataset can improve the performance and generalization of the multimodal fusion., Comment: This work has been submitted to the IEEE for possible publication
- Published
- 2024
15. EqvAfford: SE(3) Equivariance for Point-Level Affordance Learning
- Author
-
Chen, Yue, Tie, Chenrui, Wu, Ruihai, and Dong, Hao
- Subjects
Computer Science - Robotics ,Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
Humans perceive and interact with the world with the awareness of equivariance, facilitating us in manipulating different objects in diverse poses. For robotic manipulation, such equivariance also exists in many scenarios. For example, no matter what the pose of a drawer is (translation, rotation and tilt), the manipulation strategy is consistent (grasp the handle and pull in a line). While traditional models usually do not have the awareness of equivariance for robotic manipulation, which might result in more data for training and poor performance in novel object poses, we propose our EqvAfford framework, with novel designs to guarantee the equivariance in point-level affordance learning for downstream robotic manipulation, with great performance and generalization ability on representative tasks on objects in diverse poses., Comment: Accept to CVPRWorkshop on Equivariant Vision: From Theory to Practice 2024
- Published
- 2024
16. Nested Music Transformer: Sequentially Decoding Compound Tokens in Symbolic Music and Audio Generation
- Author
-
Ryu, Jiwoo, Dong, Hao-Wen, Jung, Jongmin, and Jeong, Dasaem
- Subjects
Computer Science - Sound ,Computer Science - Information Retrieval ,Computer Science - Machine Learning ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Representing symbolic music with compound tokens, where each token consists of several different sub-tokens representing a distinct musical feature or attribute, offers the advantage of reducing sequence length. While previous research has validated the efficacy of compound tokens in music sequence modeling, predicting all sub-tokens simultaneously can lead to suboptimal results as it may not fully capture the interdependencies between them. We introduce the Nested Music Transformer (NMT), an architecture tailored for decoding compound tokens autoregressively, similar to processing flattened tokens, but with low memory usage. The NMT consists of two transformers: the main decoder that models a sequence of compound tokens and the sub-decoder for modeling sub-tokens of each compound token. The experiment results showed that applying the NMT to compound tokens can enhance the performance in terms of better perplexity in processing various symbolic music datasets and discrete audio tokens from the MAESTRO dataset., Comment: Accepted at 25th International Society for Music Information Retrieval Conference (ISMIR 2024)
- Published
- 2024
17. Futga: Towards Fine-grained Music Understanding through Temporally-enhanced Generative Augmentation
- Author
-
Wu, Junda, Novack, Zachary, Namburi, Amit, Dai, Jiaheng, Dong, Hao-Wen, Xie, Zhouhang, Chen, Carol, and McAuley, Julian
- Subjects
Computer Science - Sound ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Existing music captioning methods are limited to generating concise global descriptions of short music clips, which fail to capture fine-grained musical characteristics and time-aware musical changes. To address these limitations, we propose FUTGA, a model equipped with fined-grained music understanding capabilities through learning from generative augmentation with temporal compositions. We leverage existing music caption datasets and large language models (LLMs) to synthesize fine-grained music captions with structural descriptions and time boundaries for full-length songs. Augmented by the proposed synthetic dataset, FUTGA is enabled to identify the music's temporal changes at key transition points and their musical functions, as well as generate detailed descriptions for each music segment. We further introduce a full-length music caption dataset generated by FUTGA, as the augmentation of the MusicCaps and the Song Describer datasets. We evaluate the automatically generated captions on several downstream tasks, including music generation and retrieval. The experiments demonstrate the quality of the generated captions and the better performance in various downstream tasks achieved by the proposed music captioning approach. Our code and datasets can be found in \href{https://huggingface.co/JoshuaW1997/FUTGA}{\textcolor{blue}{https://huggingface.co/JoshuaW1997/FUTGA}}., Comment: 6 pages
- Published
- 2024
18. Local Occupancy-Enhanced Object Grasping with Multiple Triplanar Projection
- Author
-
Ma, Kangqi, Dong, Hao, and Mu, Yadong
- Subjects
Computer Science - Robotics ,Computer Science - Artificial Intelligence - Abstract
This paper addresses the challenge of robotic grasping of general objects. Similar to prior research, the task reads a single-view 3D observation (i.e., point clouds) captured by a depth camera as input. Crucially, the success of object grasping highly demands a comprehensive understanding of the shape of objects within the scene. However, single-view observations often suffer from occlusions (including both self and inter-object occlusions), which lead to gaps in the point clouds, especially in complex cluttered scenes. This renders incomplete perception of the object shape and frequently causes failures or inaccurate pose estimation during object grasping. In this paper, we tackle this issue with an effective albeit simple solution, namely completing grasping-related scene regions through local occupancy prediction. Following prior practice, the proposed model first runs by proposing a number of most likely grasp points in the scene. Around each grasp point, a module is designed to infer any voxel in its neighborhood to be either void or occupied by some object. Importantly, the occupancy map is inferred by fusing both local and global cues. We implement a multi-group tri-plane scheme for efficiently aggregating long-distance contextual information. The model further estimates 6-DoF grasp poses utilizing the local occupancy-enhanced object shape information and returns the top-ranked grasp proposal. Comprehensive experiments on both the large-scale GraspNet-1Billion benchmark and real robotic arm demonstrate that the proposed method can effectively complete the unobserved parts in cluttered and occluded scenes. Benefiting from the occupancy-enhanced feature, our model clearly outstrips other competing methods under various performance metrics such as grasping average precision.
- Published
- 2024
19. TARGO: Benchmarking Target-driven Object Grasping under Occlusions
- Author
-
Xia, Yan, Ding, Ran, Qin, Ziyuan, Zhan, Guanqi, Zhou, Kaichen, Yang, Long, Dong, Hao, and Cremers, Daniel
- Subjects
Computer Science - Robotics ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Recent advances in predicting 6D grasp poses from a single depth image have led to promising performance in robotic grasping. However, previous grasping models face challenges in cluttered environments where nearby objects impact the target object's grasp. In this paper, we first establish a new benchmark dataset for TARget-driven Grasping under Occlusions, named TARGO. We make the following contributions: 1) We are the first to study the occlusion level of grasping. 2) We set up an evaluation benchmark consisting of large-scale synthetic data and part of real-world data, and we evaluated five grasp models and found that even the current SOTA model suffers when the occlusion level increases, leaving grasping under occlusion still a challenge. 3) We also generate a large-scale training dataset via a scalable pipeline, which can be used to boost the performance of grasping under occlusion and generalized to the real world. 4) We further propose a transformer-based grasping model involving a shape completion module, termed TARGO-Net, which performs most robustly as occlusion increases. Our benchmark dataset can be found at https://TARGO-benchmark.github.io/., Comment: 19 pages, 17 figures
- Published
- 2024
20. Towards Multimodal Open-Set Domain Generalization and Adaptation through Self-supervision
- Author
-
Dong, Hao, Chatzi, Eleni, and Fink, Olga
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning - Abstract
The task of open-set domain generalization (OSDG) involves recognizing novel classes within unseen domains, which becomes more challenging with multiple modalities as input. Existing works have only addressed unimodal OSDG within the meta-learning framework, without considering multimodal scenarios. In this work, we introduce a novel approach to address Multimodal Open-Set Domain Generalization (MM-OSDG) for the first time, utilizing self-supervision. To this end, we introduce two innovative multimodal self-supervised pretext tasks: Masked Cross-modal Translation and Multimodal Jigsaw Puzzles. These tasks facilitate the learning of multimodal representative features, thereby enhancing generalization and open-class detection capabilities. Additionally, we propose a novel entropy weighting mechanism to balance the loss across different modalities. Furthermore, we extend our approach to tackle also the Multimodal Open-Set Domain Adaptation (MM-OSDA) problem, especially in scenarios where unlabeled data from the target domain is available. Extensive experiments conducted under MM-OSDG, MM-OSDA, and Multimodal Closed-Set DG settings on the EPIC-Kitchens and HAC datasets demonstrate the efficacy and versatility of the proposed approach. Our source code is available at https://github.com/donghao51/MOOSA., Comment: Accepted by ECCV 2024, code: https://github.com/donghao51/MOOSA
- Published
- 2024
21. Human-centered In-building Embodied Delivery Benchmark
- Author
-
Xu, Zhuoqun, Liu, Yang, Li, Xiaoqi, Zhang, Jiyao, and Dong, Hao
- Subjects
Computer Science - Robotics ,Computer Science - Artificial Intelligence - Abstract
Recently, the concept of embodied intelligence has been widely accepted and popularized, leading people to naturally consider the potential for commercialization in this field. In this work, we propose a specific commercial scenario simulation, human-centered in-building embodied delivery. Furthermore, for this scenario, we have developed a brand-new virtual environment system from scratch, constructing a multi-level connected building space modeled after a polar research station. This environment also includes autonomous human characters and robots with grasping and mobility capabilities, as well as a large number of interactive items. Based on this environment, we have built a delivery dataset containing 13k language instructions to guide robots in providing services. We simulate human behavior through human characters and sample their various needs in daily life. Finally, we proposed a method centered around a large multimodal model to serve as the baseline system for this dataset. Compared to past embodied data work, our work focuses on a virtual environment centered around human-robot interaction for commercial scenarios. We believe this will bring new perspectives and exploration angles to the embodied community.
- Published
- 2024
22. Make Graph Neural Networks Great Again: A Generic Integration Paradigm of Topology-Free Patterns for Traffic Speed Prediction
- Author
-
Zhou, Yicheng, Wang, Pengfei, Dong, Hao, Zhang, Denghui, Yang, Dingqi, Fu, Yanjie, and Wang, Pengyang
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence - Abstract
Urban traffic speed prediction aims to estimate the future traffic speed for improving urban transportation services. Enormous efforts have been made to exploit Graph Neural Networks (GNNs) for modeling spatial correlations and temporal dependencies of traffic speed evolving patterns, regularized by graph topology.While achieving promising results, current traffic speed prediction methods still suffer from ignoring topology-free patterns, which cannot be captured by GNNs. To tackle this challenge, we propose a generic model for enabling the current GNN-based methods to preserve topology-free patterns. Specifically, we first develop a Dual Cross-Scale Transformer (DCST) architecture, including a Spatial Transformer and a Temporal Transformer, to preserve the cross-scale topology-free patterns and associated dynamics, respectively. Then, to further integrate both topology-regularized/-free patterns, we propose a distillation-style learning framework, in which the existing GNN-based methods are considered as the teacher model, and the proposed DCST architecture is considered as the student model. The teacher model would inject the learned topology-regularized patterns into the student model for integrating topology-free patterns. The extensive experimental results demonstrated the effectiveness of our methods., Comment: Accepted to IJCAI 2024
- Published
- 2024
23. SpatialBot: Precise Spatial Understanding with Vision Language Models
- Author
-
Cai, Wenxiao, Ponomarenko, Iaroslav, Yuan, Jianhao, Li, Xiaoqi, Yang, Wankou, Dong, Hao, and Zhao, Bo
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Vision Language Models (VLMs) have achieved impressive performance in 2D image understanding, however they are still struggling with spatial understanding which is the foundation of Embodied AI. In this paper, we propose SpatialBot for better spatial understanding by feeding both RGB and depth images. Additionally, we have constructed the SpatialQA dataset, which involves multi-level depth-related questions to train VLMs for depth understanding. Finally, we present SpatialBench to comprehensively evaluate VLMs' capabilities in spatial understanding at different levels. Extensive experiments on our spatial-understanding benchmark, general VLM benchmarks and Embodied AI tasks, demonstrate the remarkable improvements of SpatialBot trained on SpatialQA. The model, code and data are available at https://github.com/BAAI-DCAI/SpatialBot.
- Published
- 2024
24. AIC MLLM: Autonomous Interactive Correction MLLM for Robust Robotic Manipulation
- Author
-
Xiong, Chuyan, Shen, Chengyu, Li, Xiaoqi, Zhou, Kaichen, Liu, Jiaming, Wang, Ruiping, and Dong, Hao
- Subjects
Computer Science - Robotics ,Computer Science - Artificial Intelligence ,Computer Science - Computer Vision and Pattern Recognition - Abstract
The ability to reflect on and correct failures is crucial for robotic systems to interact stably with real-life objects. Observing the generalization and reasoning capabilities of Multimodal Large Language Models (MLLMs), previous approaches have aimed to utilize these models to enhance robotic systems accordingly. However, these methods typically focus on high-level planning corrections using an additional MLLM, with limited utilization of failed samples to correct low-level contact poses which is particularly prone to occur during articulated object manipulation. To address this gap, we propose an Autonomous Interactive Correction (AIC) MLLM, which makes use of previous low-level interaction experiences to correct SE(3) pose predictions for articulated object. Specifically, AIC MLLM is initially fine-tuned to acquire both pose prediction and feedback prompt comprehension abilities. We design two types of prompt instructions for interactions with objects: 1) visual masks to highlight unmovable parts for position correction, and 2) textual descriptions to indicate potential directions for rotation correction. During inference, a Feedback Information Extraction module is introduced to recognize the failure cause, allowing AIC MLLM to adaptively correct the pose prediction using the corresponding prompts. To further enhance manipulation stability, we devise a Test Time Adaptation strategy that enables AIC MLLM to better adapt to the current scene configuration. Finally, extensive experiments are conducted in both simulated and real-world environments to evaluate the proposed method. The results demonstrate that our AIC MLLM can efficiently correct failure samples by leveraging interaction experience prompts. Our project website is https://sites.google.com/view/aic-mllm.
- Published
- 2024
25. A3VLM: Actionable Articulation-Aware Vision Language Model
- Author
-
Huang, Siyuan, Chang, Haonan, Liu, Yuhan, Zhu, Yimeng, Dong, Hao, Gao, Peng, Boularias, Abdeslam, and Li, Hongsheng
- Subjects
Computer Science - Robotics - Abstract
Vision Language Models (VLMs) have received significant attention in recent years in the robotics community. VLMs are shown to be able to perform complex visual reasoning and scene understanding tasks, which makes them regarded as a potential universal solution for general robotics problems such as manipulation and navigation. However, previous VLMs for robotics such as RT-1, RT-2, and ManipLLM have focused on directly learning robot-centric actions. Such approaches require collecting a significant amount of robot interaction data, which is extremely costly in the real world. Thus, we propose A3VLM, an object-centric, actionable, articulation-aware vision language model. A3VLM focuses on the articulation structure and action affordances of objects. Its representation is robot-agnostic and can be translated into robot actions using simple action primitives. Extensive experiments in both simulation benchmarks and real-world settings demonstrate the effectiveness and stability of A3VLM. We release our code and other materials at https://github.com/changhaonan/A3VLM.
- Published
- 2024
26. GFPack++: Improving 2D Irregular Packing by Learning Gradient Field with Attention
- Author
-
Xue, Tianyang, Lu, Lin, Liu, Yang, Wu, Mingdong, Dong, Hao, Zhang, Yanbin, Han, Renmin, and Chen, Baoquan
- Subjects
Computer Science - Artificial Intelligence ,Computer Science - Graphics ,Computer Science - Machine Learning - Abstract
2D irregular packing is a classic combinatorial optimization problem with various applications, such as material utilization and texture atlas generation. This NP-hard problem requires efficient algorithms to optimize space utilization. Conventional numerical methods suffer from slow convergence and high computational cost. Existing learning-based methods, such as the score-based diffusion model, also have limitations, such as no rotation support, frequent collisions, and poor adaptability to arbitrary boundaries, and slow inferring. The difficulty of learning from teacher packing is to capture the complex geometric relationships among packing examples, which include the spatial (position, orientation) relationships of objects, their geometric features, and container boundary conditions. Representing these relationships in latent space is challenging. We propose GFPack++, an attention-based gradient field learning approach that addresses this challenge. It consists of two pivotal strategies: \emph{attention-based geometry encoding} for effective feature encoding and \emph{attention-based relation encoding} for learning complex relationships. We investigate the utilization distribution between the teacher and inference data and design a weighting function to prioritize tighter teacher data during training, enhancing learning effectiveness. Our diffusion model supports continuous rotation and outperforms existing methods on various datasets. We achieve higher space utilization over several widely used baselines, one-order faster than the previous diffusion-based method, and promising generalization for arbitrary boundaries. We plan to release our source code and datasets to support further research in this direction.
- Published
- 2024
27. InstructNav: Zero-shot System for Generic Instruction Navigation in Unexplored Environment
- Author
-
Long, Yuxing, Cai, Wenzhe, Wang, Hongcheng, Zhan, Guanqi, and Dong, Hao
- Subjects
Computer Science - Robotics ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Enabling robots to navigate following diverse language instructions in unexplored environments is an attractive goal for human-robot interaction. However, this goal is challenging because different navigation tasks require different strategies. The scarcity of instruction navigation data hinders training an instruction navigation model with varied strategies. Therefore, previous methods are all constrained to one specific type of navigation instruction. In this work, we propose InstructNav, a generic instruction navigation system. InstructNav makes the first endeavor to handle various instruction navigation tasks without any navigation training or pre-built maps. To reach this goal, we introduce Dynamic Chain-of-Navigation (DCoN) to unify the planning process for different types of navigation instructions. Furthermore, we propose Multi-sourced Value Maps to model key elements in instruction navigation so that linguistic DCoN planning can be converted into robot actionable trajectories. With InstructNav, we complete the R2R-CE task in a zero-shot way for the first time and outperform many task-training methods. Besides, InstructNav also surpasses the previous SOTA method by 10.48% on the zero-shot Habitat ObjNav and by 86.34% on demand-driven navigation DDN. Real robot experiments on diverse indoor scenes further demonstrate our method's robustness in coping with the environment and instruction variations., Comment: Submitted to CoRL 2024
- Published
- 2024
28. Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking
- Author
-
Zhang, Jiyao, Huang, Weiyao, Peng, Bo, Wu, Mingdong, Hu, Fei, Chen, Zijian, Zhao, Bo, and Dong, Hao
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
6D Object Pose Estimation is a crucial yet challenging task in computer vision, suffering from a significant lack of large-scale datasets. This scarcity impedes comprehensive evaluation of model performance, limiting research advancements. Furthermore, the restricted number of available instances or categories curtails its applications. To address these issues, this paper introduces Omni6DPose, a substantial dataset characterized by its diversity in object categories, large scale, and variety in object materials. Omni6DPose is divided into three main components: ROPE (Real 6D Object Pose Estimation Dataset), which includes 332K images annotated with over 1.5M annotations across 581 instances in 149 categories; SOPE(Simulated 6D Object Pose Estimation Dataset), consisting of 475K images created in a mixed reality setting with depth simulation, annotated with over 5M annotations across 4162 instances in the same 149 categories; and the manually aligned real scanned objects used in both ROPE and SOPE. Omni6DPose is inherently challenging due to the substantial variations and ambiguities. To address this challenge, we introduce GenPose++, an enhanced version of the SOTA category-level pose estimation framework, incorporating two pivotal improvements: Semantic-aware feature extraction and Clustering-based aggregation. Moreover, we provide a comprehensive benchmarking analysis to evaluate the performance of previous methods on this large-scale dataset in the realms of 6D object pose estimation and pose tracking.
- Published
- 2024
29. Broadcasting Support Relations Recursively from Local Dynamics for Object Retrieval in Clutters
- Author
-
Li, Yitong, Wu, Ruihai, Lu, Haoran, Ning, Chuanruo, Shen, Yan, Zhan, Guanqi, and Dong, Hao
- Subjects
Computer Science - Robotics - Abstract
In our daily life, cluttered objects are everywhere, from scattered stationery and books cluttering the table to bowls and plates filling the kitchen sink. Retrieving a target object from clutters is an essential while challenging skill for robots, for the difficulty of safely manipulating an object without disturbing others, which requires the robot to plan a manipulation sequence and first move away a few other objects supported by the target object step by step. However, due to the diversity of object configurations (e.g., categories, geometries, locations and poses) and their combinations in clutters, it is difficult for a robot to accurately infer the support relations between objects faraway with various objects in between. In this paper, we study retrieving objects in complicated clutters via a novel method of recursively broadcasting the accurate local dynamics to build a support relation graph of the whole scene, which largely reduces the complexity of the support relation inference and improves the accuracy. Experiments in both simulation and the real world demonstrate the efficiency and effectiveness of our method., Comment: RSS 2024
- Published
- 2024
30. Learning Manipulation by Predicting Interaction
- Author
-
Zeng, Jia, Bu, Qingwen, Wang, Bangjun, Xia, Wenke, Chen, Li, Dong, Hao, Song, Haoming, Wang, Dong, Hu, Di, Luo, Ping, Cui, Heming, Zhao, Bin, Li, Xuelong, Qiao, Yu, and Li, Hongyang
- Subjects
Computer Science - Robotics ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Representation learning approaches for robotic manipulation have boomed in recent years. Due to the scarcity of in-domain robot data, prevailing methodologies tend to leverage large-scale human video datasets to extract generalizable features for visuomotor policy learning. Despite the progress achieved, prior endeavors disregard the interactive dynamics that capture behavior patterns and physical interaction during the manipulation process, resulting in an inadequate understanding of the relationship between objects and the environment. To this end, we propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction (MPI) and enhances the visual representation.Given a pair of keyframes representing the initial and final states, along with language instructions, our algorithm predicts the transition frame and detects the interaction object, respectively. These two learning objectives achieve superior comprehension towards "how-to-interact" and "where-to-interact". We conduct a comprehensive evaluation of several challenging robotic tasks.The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms as well as simulation environments. Code and checkpoints are publicly shared at https://github.com/OpenDriveLab/MPI., Comment: Accepted to RSS 2024. Project page: https://github.com/OpenDriveLab/MPI
- Published
- 2024
31. MultiOOD: Scaling Out-of-Distribution Detection for Multiple Modalities
- Author
-
Dong, Hao, Zhao, Yue, Chatzi, Eleni, and Fink, Olga
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning - Abstract
Detecting out-of-distribution (OOD) samples is important for deploying machine learning models in safety-critical applications such as autonomous driving and robot-assisted surgery. Existing research has mainly focused on unimodal scenarios on image data. However, real-world applications are inherently multimodal, which makes it essential to leverage information from multiple modalities to enhance the efficacy of OOD detection. To establish a foundation for more realistic Multimodal OOD Detection, we introduce the first-of-its-kind benchmark, MultiOOD, characterized by diverse dataset sizes and varying modality combinations. We first evaluate existing unimodal OOD detection algorithms on MultiOOD, observing that the mere inclusion of additional modalities yields substantial improvements. This underscores the importance of utilizing multiple modalities for OOD detection. Based on the observation of Modality Prediction Discrepancy between in-distribution (ID) and OOD data, and its strong correlation with OOD performance, we propose the Agree-to-Disagree (A2D) algorithm to encourage such discrepancy during training. Moreover, we introduce a novel outlier synthesis method, NP-Mix, which explores broader feature spaces by leveraging the information from nearest neighbor classes and complements A2D to strengthen OOD detection performance. Extensive experiments on MultiOOD demonstrate that training with A2D and NP-Mix improves existing OOD detection algorithms by a large margin. Our source code and MultiOOD benchmark are available at https://github.com/donghao51/MultiOOD., Comment: NeurIPS 2024 spotlight. Code and MultiOOD benchmark: https://github.com/donghao51/MultiOOD
- Published
- 2024
32. UniGarmentManip: A Unified Framework for Category-Level Garment Manipulation via Dense Visual Correspondence
- Author
-
Wu, Ruihai, Lu, Haoran, Wang, Yiyan, Wang, Yubo, and Dong, Hao
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Garment manipulation (e.g., unfolding, folding and hanging clothes) is essential for future robots to accomplish home-assistant tasks, while highly challenging due to the diversity of garment configurations, geometries and deformations. Although able to manipulate similar shaped garments in a certain task, previous works mostly have to design different policies for different tasks, could not generalize to garments with diverse geometries, and often rely heavily on human-annotated data. In this paper, we leverage the property that, garments in a certain category have similar structures, and then learn the topological dense (point-level) visual correspondence among garments in the category level with different deformations in the self-supervised manner. The topological correspondence can be easily adapted to the functional correspondence to guide the manipulation policies for various downstream tasks, within only one or few-shot demonstrations. Experiments over garments in 3 different categories on 3 representative tasks in diverse scenarios, using one or two arms, taking one or more steps, inputting flat or messy garments, demonstrate the effectiveness of our proposed method. Project page: https://warshallrho.github.io/unigarmentmanip., Comment: CVPR 2024
- Published
- 2024
33. Invisible and Semi-invisible Decays of Bottom Baryons
- Author
-
Zheng, Yong, Ding, Jian-Nan, Li, Dong-Hao, Li, Lei-Yi, Lü, Cai-Dian, and Yu, Fu-Sheng
- Subjects
High Energy Physics - Phenomenology ,High Energy Physics - Experiment - Abstract
The similar densities of dark matter and baryons in the universe imply that they might arise from the same ultraviolet model. The B-Mesogenesis, which assumes dark matter is charged under the baryon number, attempts to simultaneously explain the origin of baryon asymmetry and dark matter in the universe. In particular, the B-Mesogenesis might induce bottom-baryon decays into invisible or semi-invisible final states, which provide a distinctive signal for probing this scenario. In this work, we systematically study the invisible decays of bottom baryons into dark matters, and semi-invisible decays of bottom baryons into a meson or a photon together with a dark matter particle. In particular, the fully invisible decay can explore the stable particles in B-Mesogenesis. Some QCD-based frameworks are used to calculate the hadronic matrix elements under the B-Mesogenesis model. We estimate the constraints on the Wilson coefficients or the product of some new physics couplings with the Wilson coefficients by the semi-invisible and invisible decays of bottom baryons at future colliders., Comment: 25 pages, 7 figures
- Published
- 2024
- Full Text
- View/download PDF
34. No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation
- Author
-
Zhu, Xiangyang, Zhang, Renrui, He, Bowei, Guo, Ziyu, Liu, Jiaming, Xiao, Han, Fu, Chaoyou, Dong, Hao, and Gao, Peng
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
To reduce the reliance on large-scale datasets, recent works in 3D segmentation resort to few-shot learning. Current 3D few-shot segmentation methods first pre-train models on 'seen' classes, and then evaluate their generalization performance on 'unseen' classes. However, the prior pre-training stage not only introduces excessive time overhead but also incurs a significant domain gap on 'unseen' classes. To tackle these issues, we propose a Non-parametric Network for few-shot 3D Segmentation, Seg-NN, and its Parametric variant, Seg-PN. Without training, Seg-NN extracts dense representations by hand-crafted filters and achieves comparable performance to existing parametric models. Due to the elimination of pre-training, Seg-NN can alleviate the domain gap issue and save a substantial amount of time. Based on Seg-NN, Seg-PN only requires training a lightweight QUEry-Support Transferring (QUEST) module, which enhances the interaction between the support set and query set. Experiments suggest that Seg-PN outperforms previous state-of-the-art method by +4.19% and +7.71% mIoU on S3DIS and ScanNet datasets respectively, while reducing training time by -90%, indicating its effectiveness and efficiency., Comment: CVPR Highlight. Code is available at https://github.com/yangyangyang127/Seg-NN. arXiv admin note: text overlap with arXiv:2308.12961
- Published
- 2024
35. PreAfford: Universal Affordance-Based Pre-Grasping for Diverse Objects and Environments
- Author
-
Ding, Kairui, Chen, Boyuan, Wu, Ruihai, Li, Yuyang, Zhang, Zongzheng, Gao, Huan-ang, Li, Siqi, Zhou, Guyue, Zhu, Yixin, Dong, Hao, and Zhao, Hao
- Subjects
Computer Science - Robotics ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Robotic manipulation with two-finger grippers is challenged by objects lacking distinct graspable features. Traditional pre-grasping methods, which typically involve repositioning objects or utilizing external aids like table edges, are limited in their adaptability across different object categories and environments. To overcome these limitations, we introduce PreAfford, a novel pre-grasping planning framework incorporating a point-level affordance representation and a relay training approach. Our method significantly improves adaptability, allowing effective manipulation across a wide range of environments and object types. When evaluated on the ShapeNet-v2 dataset, PreAfford not only enhances grasping success rates by 69% but also demonstrates its practicality through successful real-world experiments. These improvements highlight PreAfford's potential to redefine standards for robotic handling of complex manipulation tasks in diverse settings., Comment: Project Page: https://air-discover.github.io/PreAfford/
- Published
- 2024
36. RoboKeyGen: Robot Pose and Joint Angles Estimation via Diffusion-based 3D Keypoint Generation
- Author
-
Tian, Yang, Zhang, Jiyao, Huang, Guowei, Wang, Bin, Wang, Ping, Pang, Jiangmiao, and Dong, Hao
- Subjects
Computer Science - Robotics - Abstract
Estimating robot pose and joint angles is significant in advanced robotics, enabling applications like robot collaboration and online hand-eye calibration.However, the introduction of unknown joint angles makes prediction more complex than simple robot pose estimation, due to its higher dimensionality.Previous methods either regress 3D keypoints directly or utilise a render&compare strategy. These approaches often falter in terms of performance or efficiency and grapple with the cross-camera gap problem.This paper presents a novel framework that bifurcates the high-dimensional prediction task into two manageable subtasks: 2D keypoints detection and lifting 2D keypoints to 3D. This separation promises enhanced performance without sacrificing the efficiency innate to keypoint-based techniques.A vital component of our method is the lifting of 2D keypoints to 3D keypoints. Common deterministic regression methods may falter when faced with uncertainties from 2D detection errors or self-occlusions.Leveraging the robust modeling potential of diffusion models, we reframe this issue as a conditional 3D keypoints generation task. To bolster cross-camera adaptability, we introduce theNormalised Camera Coordinate Space (NCCS), ensuring alignment of estimated 2D keypoints across varying camera intrinsics.Experimental results demonstrate that the proposed method outperforms the state-of-the-art render\&compare method and achieves higher inference speed.Furthermore, the tests accentuate our method's robust cross-camera generalisation capabilities.We intend to release both the dataset and code in https://nimolty.github.io/Robokeygen/, Comment: Accepted by ICRA 2024
- Published
- 2024
37. SCANet: Correcting LEGO Assembly Errors with Self-Correct Assembly Network
- Author
-
Wan, Yuxuan, Zhou, Kaichen, Chen, jinhong, and Dong, Hao
- Subjects
Computer Science - Robotics ,Computer Science - Artificial Intelligence - Abstract
Autonomous assembly in robotics and 3D vision presents significant challenges, particularly in ensuring assembly correctness. Presently, predominant methods such as MEPNet focus on assembling components based on manually provided images. However, these approaches often fall short in achieving satisfactory results for tasks requiring long-term planning. Concurrently, we observe that integrating a self-correction module can partially alleviate such issues. Motivated by this concern, we introduce the Single-Step Assembly Error Correction Task, which involves identifying and rectifying misassembled components. To support research in this area, we present the LEGO Error Correction Assembly Dataset (LEGO-ECA), comprising manual images for assembly steps and instances of assembly failures. Additionally, we propose the Self-Correct Assembly Network (SCANet), a novel method to address this task. SCANet treats assembled components as queries, determining their correctness in manual images and providing corrections when necessary. Finally, we utilize SCANet to correct the assembly results of MEPNet. Experimental results demonstrate that SCANet can identify and correct MEPNet's misassembled results, significantly improving the correctness of assembly. Our code and dataset could be found at https://scanet-iros2024.github.io/.
- Published
- 2024
38. Dexterous Functional Pre-Grasp Manipulation with Diffusion Policy
- Author
-
Wu, Tianhao, Gan, Yunchong, Wu, Mingdong, Cheng, Jingbo, Yang, Yaodong, Zhu, Yixin, and Dong, Hao
- Subjects
Computer Science - Robotics - Abstract
In real-world scenarios, objects often require repositioning and reorientation before they can be grasped, a process known as pre-grasp manipulation. Learning universal dexterous functional pre-grasp manipulation requires precise control over the relative position, orientation, and contact between the hand and object while generalizing to diverse dynamic scenarios with varying objects and goal poses. To address this challenge, we propose a teacher-student learning approach that utilizes a novel mutual reward, incentivizing agents to optimize three key criteria jointly. Additionally, we introduce a pipeline that employs a mixture-of-experts strategy to learn diverse manipulation policies, followed by a diffusion policy to capture complex action distributions from these experts. Our method achieves a success rate of 72.6\% across more than 30 object categories by leveraging extrinsic dexterity and adjusting from feedback.
- Published
- 2024
39. ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models
- Author
-
Huang, Siyuan, Ponomarenko, Iaroslav, Jiang, Zhengkai, Li, Xiaoqi, Hu, Xiaobin, Gao, Peng, Li, Hongsheng, and Dong, Hao
- Subjects
Computer Science - Robotics - Abstract
While the integration of Multi-modal Large Language Models (MLLMs) with robotic systems has significantly improved robots' ability to understand and execute natural language instructions, their performance in manipulation tasks remains limited due to a lack of robotics-specific knowledge. Conventional MLLMs are typically trained on generic image-text pairs, leaving them deficient in understanding affordances and physical concepts crucial for manipulation. To address this gap, we propose ManipVQA, a novel framework that infuses MLLMs with manipulation-centric knowledge through a Visual Question-Answering (VQA) format. This approach encompasses tool detection, affordance recognition, and a broader understanding of physical concepts. We curated a diverse dataset of images depicting interactive objects, to challenge robotic understanding in tool detection, affordance prediction, and physical concept comprehension. To effectively integrate this robotics-specific knowledge with the inherent vision-reasoning capabilities of MLLMs, we leverage a unified VQA format and devise a fine-tuning strategy. This strategy preserves the original vision-reasoning abilities while incorporating the newly acquired robotic insights. Empirical evaluations conducted in robotic simulators and across various vision task benchmarks demonstrate the robust performance of ManipVQA. The code and dataset are publicly available at https://github.com/SiyuanHuang95/ManipVQA., Comment: Code and dataset are publicly available at https://github.com/SiyuanHuang95/ManipVQA. Accepted by IROS2024
- Published
- 2024
40. NaturalVLM: Leveraging Fine-grained Natural Language for Affordance-Guided Visual Manipulation
- Author
-
Xu, Ran, Shen, Yan, Li, Xiaoqi, Wu, Ruihai, and Dong, Hao
- Subjects
Computer Science - Robotics ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Enabling home-assistant robots to perceive and manipulate a diverse range of 3D objects based on human language instructions is a pivotal challenge. Prior research has predominantly focused on simplistic and task-oriented instructions, i.e., "Slide the top drawer open". However, many real-world tasks demand intricate multi-step reasoning, and without human instructions, these will become extremely difficult for robot manipulation. To address these challenges, we introduce a comprehensive benchmark, NrVLM, comprising 15 distinct manipulation tasks, containing over 4500 episodes meticulously annotated with fine-grained language instructions. We split the long-term task process into several steps, with each step having a natural language instruction. Moreover, we propose a novel learning framework that completes the manipulation task step-by-step according to the fine-grained instructions. Specifically, we first identify the instruction to execute, taking into account visual observations and the end-effector's current state. Subsequently, our approach facilitates explicit learning through action-prompts and perception-prompts to promote manipulation-aware cross-modality alignment. Leveraging both visual observations and linguistic guidance, our model outputs a sequence of actionable predictions for manipulation, including contact points and end-effector poses. We evaluate our method and baselines using the proposed benchmark NrVLM. The experimental results demonstrate the effectiveness of our approach. For additional details, please refer to https://sites.google.com/view/naturalvlm.
- Published
- 2024
41. JSTR: Joint Spatio-Temporal Reasoning for Event-based Moving Object Detection
- Author
-
Zhou, Hanyu, Shi, Zhiwei, Dong, Hao, Peng, Shihan, Chang, Yi, and Yan, Luxin
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Event-based moving object detection is a challenging task, where static background and moving object are mixed together. Typically, existing methods mainly align the background events to the same spatial coordinate system via motion compensation to distinguish the moving object. However, they neglect the potential spatial tailing effect of moving object events caused by excessive motion, which may affect the structure integrity of the extracted moving object. We discover that the moving object has a complete columnar structure in the point cloud composed of motion-compensated events along the timestamp. Motivated by this, we propose a novel joint spatio-temporal reasoning method for event-based moving object detection. Specifically, we first compensate the motion of background events using inertial measurement unit. In spatial reasoning stage, we project the compensated events into the same image coordinate, discretize the timestamp of events to obtain a time image that can reflect the motion confidence, and further segment the moving object through adaptive threshold on the time image. In temporal reasoning stage, we construct the events into a point cloud along timestamp, and use RANSAC algorithm to extract the columnar shape in the cloud for peeling off the background. Finally, we fuse the results from the two reasoning stages to extract the final moving object region. This joint spatio-temporal reasoning framework can effectively detect the moving object from motion confidence and geometric structure. Moreover, we conduct extensive experiments on various datasets to verify that the proposed method can improve the moving object detection accuracy by 13\%.
- Published
- 2024
42. UniDoorManip: Learning Universal Door Manipulation Policy Over Large-scale and Diverse Door Manipulation Environments
- Author
-
Li, Yu, Zhang, Xiaojie, Wu, Ruihai, Zhang, Zilong, Geng, Yiran, Dong, Hao, and He, Zhaofeng
- Subjects
Computer Science - Robotics - Abstract
Learning a universal manipulation policy encompassing doors with diverse categories, geometries and mechanisms, is crucial for future embodied agents to effectively work in complex and broad real-world scenarios. Due to the limited datasets and unrealistic simulation environments, previous works fail to achieve good performance across various doors. In this work, we build a novel door manipulation environment reflecting different realistic door manipulation mechanisms, and further equip this environment with a large-scale door dataset covering 6 door categories with hundreds of door bodies and handles, making up thousands of different door instances. Additionally, to better emulate real-world scenarios, we introduce a mobile robot as the agent and use the partial and occluded point cloud as the observation, which are not considered in previous works while possessing significance for real-world implementations. To learn a universal policy over diverse doors, we propose a novel framework disentangling the whole manipulation process into three stages, and integrating them by training in the reversed order of inference. Extensive experiments validate the effectiveness of our designs and demonstrate our framework's strong performance. Code, data and videos are avaible on https://unidoormanip.github.io/., Comment: Project page https://unidoormanip.github.io/
- Published
- 2024
43. The cell-type underpinnings of the human functional cortical connectome
- Author
-
Zhang, Xi-Han, Anderson, Kevin M., Dong, Hao-Ming, Chopra, Sidhant, Dhamala, Elvisha, Emani, Prashant S., Gerstein, Mark B., Margulies, Daniel S., and Holmes, Avram J.
- Published
- 2024
- Full Text
- View/download PDF
44. Establishment of a multi-clearance coupled 3D floating nonlinear model and vibration analysis for a coaxial reverse closed differential herringbone gear transmission system
- Author
-
Dong, Hao, Han, Hao, Zhang, Yun-fan, Hou, Xiang-ying, and Jin, Guang-hu
- Published
- 2024
- Full Text
- View/download PDF
45. PCA-Net: a heart segmentation model based on the meta-learning method
- Author
-
Yang, Mengzhu, Zhu, Dong, Dong, Hao, Hu, Shunbo, and Wang, Yongfang
- Published
- 2024
- Full Text
- View/download PDF
46. Specific ECM degradation potentiates the antitumor activity of CAR-T cells in solid tumors
- Author
-
Zheng, Rui, Shen, Kuo, Liang, Sixin, Lyu, Yanhong, Zhang, Siyan, Dong, Hao, Li, Yuanfeng, Han, Yujie, Zhao, Xiaojuan, Zhang, Yiting, Wang, Pengju, Meng, Ruotong, Bai, Shukun, Yang, Jianxun, Lu, Guofang, Li, Jia, Yang, Angang, Zhang, Rui, and Yan, Bo
- Published
- 2024
- Full Text
- View/download PDF
47. Piceatannol Protects Sperm from Cryopreservation Damage by Modulating the Keap1-Nrf2/ARE Signaling Pathway
- Author
-
Fu, Lijie, Wang, Chao, Li, Wenfu, Dong, Hao, Yang, Qian, Chang, Guilin, and Liu, Jianping
- Published
- 2024
- Full Text
- View/download PDF
48. Influence of Mn Addition on the Evolution of Precipitates in Al-Cu Alloys
- Author
-
Dong, Xiongbo, Yang, Sha, Qin, Lan, Wang, XueYi, Li, Na, Zhong, Hao, and Dong, Hao
- Published
- 2024
- Full Text
- View/download PDF
49. Ventral attention network connectivity is linked to cortical maturation and cognitive ability in childhood
- Author
-
Dong, Hao-Ming, Zhang, Xi-Han, Labache, Loïc, Zhang, Shaoshi, Ooi, Leon Qi Rong, Yeo, B. T. Thomas, Margulies, Daniel S., Holmes, Avram J., and Zuo, Xi-Nian
- Published
- 2024
- Full Text
- View/download PDF
50. Precise separation and efficient recovery of Pd(II) from high-level liquid waste by XAD-based adsorbents
- Author
-
Dong, Hao-Ran, Ning, Shun-Yan, Li, Zeng-Yuan, Xu, Si-Zhi, Hu, Feng-Tao, Gao, Feng, Wang, You-Bin, Chen, Li-Feng, Yin, Xiang-Biao, Fujita, Toyohisa, Hamza, Mohammed F., and Wei, Yue-Zhou
- Published
- 2024
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.