1,251 results on '"Malik, Jitendra"'
Search Results
2. Maximizing Alignment with Minimal Feedback: Efficiently Learning Rewards for Visuomotor Robot Policy Alignment
- Author
-
Tian, Ran, Wu, Yilin, Xu, Chenfeng, Tomizuka, Masayoshi, Malik, Jitendra, and Bajcsy, Andrea
- Subjects
Computer Science - Robotics ,Computer Science - Artificial Intelligence ,Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
Visuomotor robot policies, increasingly pre-trained on large-scale datasets, promise significant advancements across robotics domains. However, aligning these policies with end-user preferences remains a challenge, particularly when the preferences are hard to specify. While reinforcement learning from human feedback (RLHF) has become the predominant mechanism for alignment in non-embodied domains like large language models, it has not seen the same success in aligning visuomotor policies due to the prohibitive amount of human feedback required to learn visual reward functions. To address this limitation, we propose Representation-Aligned Preference-based Learning (RAPL), an observation-only method for learning visual rewards from significantly less human preference feedback. Unlike traditional RLHF, RAPL focuses human feedback on fine-tuning pre-trained vision encoders to align with the end-user's visual representation and then constructs a dense visual reward via feature matching in this aligned representation space. We first validate RAPL through simulation experiments in the X-Magical benchmark and Franka Panda robotic manipulation, demonstrating that it can learn rewards aligned with human preferences, more efficiently uses preference data, and generalizes across robot embodiments. Finally, our hardware experiments align pre-trained Diffusion Policies for three object manipulation tasks. We find that RAPL can fine-tune these policies with 5x less real human preference data, taking the first step towards minimizing human feedback while maximizing visuomotor robot policy alignment., Comment: Submitted to IJRR, this paper is an extended journal version of the conference paper arXiv:2310.07932 with new results and discussion. arXiv admin note: substantial text overlap with arXiv:2310.07932
- Published
- 2024
3. Scaling Properties of Diffusion Models for Perceptual Tasks
- Author
-
Ravishankar, Rahul, Patel, Zeeshan, Rajasegaran, Jathushan, and Malik, Jitendra
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
In this paper, we argue that iterative computation with diffusion models offers a powerful paradigm for not only generation but also visual perception tasks. We unify tasks such as depth estimation, optical flow, and amodal segmentation under the framework of image-to-image translation, and show how diffusion models benefit from scaling training and test-time compute for these perceptual tasks. Through a careful analysis of these scaling properties, we formulate compute-optimal training and inference recipes to scale diffusion models for visual perception tasks. Our models achieve competitive performance to state-of-the-art methods using significantly less data and compute. To access our code and models, see https://scaling-diffusion-perception.github.io .
- Published
- 2024
4. Digitizing Touch with an Artificial Multimodal Fingertip
- Author
-
Lambeta, Mike, Wu, Tingfan, Sengul, Ali, Most, Victoria Rose, Black, Nolan, Sawyer, Kevin, Mercado, Romeo, Qi, Haozhi, Sohn, Alexander, Taylor, Byron, Tydingco, Norb, Kammerer, Gregg, Stroud, Dave, Khatha, Jake, Jenkins, Kurt, Most, Kyle, Stein, Neal, Chavira, Ricardo, Craven-Bartle, Thomas, Sanchez, Eric, Ding, Yitian, Malik, Jitendra, and Calandra, Roberto
- Subjects
Computer Science - Robotics ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning ,I.2.0 ,I.2.9 - Abstract
Touch is a crucial sensing modality that provides rich information about object properties and interactions with the physical environment. Humans and robots both benefit from using touch to perceive and interact with the surrounding environment (Johansson and Flanagan, 2009; Li et al., 2020; Calandra et al., 2017). However, no existing systems provide rich, multi-modal digital touch-sensing capabilities through a hemispherical compliant embodiment. Here, we describe several conceptual and technological innovations to improve the digitization of touch. These advances are embodied in an artificial finger-shaped sensor with advanced sensing capabilities. Significantly, this fingertip contains high-resolution sensors (~8.3 million taxels) that respond to omnidirectional touch, capture multi-modal signals, and use on-device artificial intelligence to process the data in real time. Evaluations show that the artificial fingertip can resolve spatial features as small as 7 um, sense normal and shear forces with a resolution of 1.01 mN and 1.27 mN, respectively, perceive vibrations up to 10 kHz, sense heat, and even sense odor. Furthermore, it embeds an on-device AI neural network accelerator that acts as a peripheral nervous system on a robot and mimics the reflex arc found in humans. These results demonstrate the possibility of digitizing touch with superhuman performance. The implications are profound, and we anticipate potential applications in robotics (industrial, medical, agricultural, and consumer-level), virtual reality and telepresence, prosthetics, and e-commerce. Toward digitizing touch at scale, we open-source a modular platform to facilitate future research on the nature of touch., Comment: 28 pages
- Published
- 2024
5. Estimating Body and Hand Motion in an Ego-sensed World
- Author
-
Yi, Brent, Ye, Vickie, Zheng, Maya, Müller, Lea, Pavlakos, Georgios, Ma, Yi, Malik, Jitendra, and Kanazawa, Angjoo
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
We present EgoAllo, a system for human motion estimation from a head-mounted device. Using only egocentric SLAM poses and images, EgoAllo guides sampling from a conditional diffusion model to estimate 3D body pose, height, and hand parameters that capture the wearer's actions in the allocentric coordinate frame of the scene. To achieve this, our key insight is in representation: we propose spatial and temporal invariance criteria for improving model performance, from which we derive a head motion conditioning parameterization that improves estimation by up to 18%. We also show how the bodies estimated by our system can improve the hands: the resulting kinematic and temporal constraints result in over 40% lower hand estimation errors compared to noisy monocular estimates. Project page: https://egoallo.github.io/, Comment: v2: fixed figures for Safari, typos
- Published
- 2024
6. Learning Humanoid Locomotion over Challenging Terrain
- Author
-
Radosavovic, Ilija, Kamat, Sarthak, Darrell, Trevor, and Malik, Jitendra
- Subjects
Computer Science - Robotics ,Computer Science - Machine Learning - Abstract
Humanoid robots can, in principle, use their legs to go almost anywhere. Developing controllers capable of traversing diverse terrains, however, remains a considerable challenge. Classical controllers are hard to generalize broadly while the learning-based methods have primarily focused on gentle terrains. Here, we present a learning-based approach for blind humanoid locomotion capable of traversing challenging natural and man-made terrain. Our method uses a transformer model to predict the next action based on the history of proprioceptive observations and actions. The model is first pre-trained on a dataset of flat-ground trajectories with sequence modeling, and then fine-tuned on uneven terrain using reinforcement learning. We evaluate our model on a real humanoid robot across a variety of terrains, including rough, deformable, and sloped surfaces. The model demonstrates robust performance, in-context adaptation, and emergent terrain representations. In real-world case studies, our humanoid robot successfully traversed over 4 miles of hiking trails in Berkeley and climbed some of the steepest streets in San Francisco., Comment: Project page: https://humanoid-challenging-terrain.github.io
- Published
- 2024
7. A Learning-based Quadcopter Controller with Extreme Adaptation
- Author
-
Zhang, Dingqi, Loquercio, Antonio, Tang, Jerry, Wang, Ting-Hao, Malik, Jitendra, and Mueller, Mark W.
- Subjects
Computer Science - Robotics - Abstract
This paper introduces a learning-based low-level controller for quadcopters, which adaptively controls quadcopters with significant variations in mass, size, and actuator capabilities. Our approach leverages a combination of imitation learning and reinforcement learning, creating a fast-adapting and general control framework for quadcopters that eliminates the need for precise model estimation or manual tuning. The controller estimates a latent representation of the vehicle's system parameters from sensor-action history, enabling it to adapt swiftly to diverse dynamics. Extensive evaluations in simulation demonstrate the controller's ability to generalize to unseen quadcopter parameters, with an adaptation range up to 16 times broader than the training set. In real-world tests, the controller is successfully deployed on quadcopters with mass differences of 3.7 times and propeller constants varying by more than 100 times, while also showing rapid adaptation to disturbances such as off-center payloads and motor failures. These results highlight the potential of our controller in extreme adaptation to simplify the design process and enhance the reliability of autonomous drone operations in unpredictable environments. The video and code are at: https://github.com/muellerlab/xadapt_ctrl, Comment: 12 pages, 9 figures
- Published
- 2024
8. Hand-Object Interaction Pretraining from Videos
- Author
-
Singh, Himanshu Gaurav, Loquercio, Antonio, Sferrazza, Carmelo, Wu, Jane, Qi, Haozhi, Abbeel, Pieter, and Malik, Jitendra
- Subjects
Computer Science - Robotics ,Computer Science - Artificial Intelligence ,Computer Science - Computer Vision and Pattern Recognition - Abstract
We present an approach to learn general robot manipulation priors from 3D hand-object interaction trajectories. We build a framework to use in-the-wild videos to generate sensorimotor robot trajectories. We do so by lifting both the human hand and the manipulated object in a shared 3D space and retargeting human motions to robot actions. Generative modeling on this data gives us a task-agnostic base policy. This policy captures a general yet flexible manipulation prior. We empirically demonstrate that finetuning this policy, with both reinforcement learning (RL) and behavior cloning (BC), enables sample-efficient adaptation to downstream tasks and simultaneously improves robustness and generalizability compared to prior approaches. Qualitative experiments are available at: \url{https://hgaurav2k.github.io/hop/}.
- Published
- 2024
9. Synergy and Synchrony in Couple Dances
- Author
-
Maluleke, Vongani, Müller, Lea, Rajasegaran, Jathushan, Pavlakos, Georgios, Ginosar, Shiry, Kanazawa, Angjoo, and Malik, Jitendra
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
This paper asks to what extent social interaction influences one's behavior. We study this in the setting of two dancers dancing as a couple. We first consider a baseline in which we predict a dancer's future moves conditioned only on their past motion without regard to their partner. We then investigate the advantage of taking social information into account by conditioning also on the motion of their dancing partner. We focus our analysis on Swing, a dance genre with tight physical coupling for which we present an in-the-wild video dataset. We demonstrate that single-person future motion prediction in this context is challenging. Instead, we observe that prediction greatly benefits from considering the interaction partners' behavior, resulting in surprisingly compelling couple dance synthesis results (see supp. video). Our contributions are a demonstration of the advantages of socially conditioned future motion prediction and an in-the-wild, couple dance video dataset to enable future research in this direction. Video results are available on the project website: https://von31.github.io/synNsync
- Published
- 2024
10. Wolf: Captioning Everything with a World Summarization Framework
- Author
-
Li, Boyi, Zhu, Ligeng, Tian, Ran, Tan, Shuhan, Chen, Yuxiao, Lu, Yao, Cui, Yin, Veer, Sushant, Ehrlich, Max, Philion, Jonah, Weng, Xinshuo, Xue, Fuzhao, Tao, Andrew, Liu, Ming-Yu, Fidler, Sanja, Ivanovic, Boris, Darrell, Trevor, Malik, Jitendra, Han, Song, and Pavone, Marco
- Subjects
Computer Science - Machine Learning ,Computer Science - Computation and Language ,Computer Science - Computer Vision and Pattern Recognition - Abstract
We propose Wolf, a WOrLd summarization Framework for accurate video captioning. Wolf is an automated captioning framework that adopts a mixture-of-experts approach, leveraging complementary strengths of Vision Language Models (VLMs). By utilizing both image and video models, our framework captures different levels of information and summarizes them efficiently. Our approach can be applied to enhance video understanding, auto-labeling, and captioning. To evaluate caption quality, we introduce CapScore, an LLM-based metric to assess the similarity and quality of generated captions compared to the ground truth captions. We further build four human-annotated datasets in three domains: autonomous driving, general scenes, and robotics, to facilitate comprehensive comparisons. We show that Wolf achieves superior captioning performance compared to state-of-the-art approaches from the research community (VILA1.5, CogAgent) and commercial solutions (Gemini-Pro-1.5, GPT-4V). For instance, in comparison with GPT-4V, Wolf improves CapScore both quality-wise by 55.6% and similarity-wise by 77.4% on challenging driving videos. Finally, we establish a benchmark for video captioning and introduce a leaderboard, aiming to accelerate advancements in video understanding, captioning, and data alignment. Leaderboard: https://wolfv0.github.io/leaderboard.html.
- Published
- 2024
11. Lessons from Learning to Spin 'Pens'
- Author
-
Wang, Jun, Yuan, Ying, Che, Haichuan, Qi, Haozhi, Ma, Yi, Malik, Jitendra, and Wang, Xiaolong
- Subjects
Computer Science - Robotics ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning - Abstract
In-hand manipulation of pen-like objects is an important skill in our daily lives, as many tools such as hammers and screwdrivers are similarly shaped. However, current learning-based methods struggle with this task due to a lack of high-quality demonstrations and the significant gap between simulation and the real world. In this work, we push the boundaries of learning-based in-hand manipulation systems by demonstrating the capability to spin pen-like objects. We first use reinforcement learning to train an oracle policy with privileged information and generate a high-fidelity trajectory dataset in simulation. This serves two purposes: 1) pre-training a sensorimotor policy in simulation; 2) conducting open-loop trajectory replay in the real world. We then fine-tune the sensorimotor policy using these real-world trajectories to adapt it to the real world dynamics. With less than 50 trajectories, our policy learns to rotate more than ten pen-like objects with different physical properties for multiple revolutions. We present a comprehensive analysis of our design choices and share the lessons learned during development., Comment: CoRL 2024. Website: https://penspin.github.io/
- Published
- 2024
12. Learning In-Hand Translation Using Tactile Skin With Shear and Normal Force Sensing
- Author
-
Yin, Jessica, Qi, Haozhi, Malik, Jitendra, Pikul, James, Yim, Mark, and Hellebrekers, Tess
- Subjects
Computer Science - Robotics ,Computer Science - Machine Learning - Abstract
Recent progress in reinforcement learning (RL) and tactile sensing has significantly advanced dexterous manipulation. However, these methods often utilize simplified tactile signals due to the gap between tactile simulation and the real world. We introduce a sensor model for tactile skin that enables zero-shot sim-to-real transfer of ternary shear and binary normal forces. Using this model, we develop an RL policy that leverages sliding contact for dexterous in-hand translation. We conduct extensive real-world experiments to assess how tactile sensing facilitates policy adaptation to various unseen object properties and robot hand orientations. We demonstrate that our 3-axis tactile policies consistently outperform baselines that use only shear forces, only normal forces, or only proprioception. Website: https://jessicayin.github.io/tactile-skin-rl/, Comment: Website: https://jessicayin.github.io/tactile-skin-rl/
- Published
- 2024
13. Discrete Legendre Projection Methods for the Fredholm Integral Equations of the Second Kind
- Author
-
Malik, Jitendra Kumar and Panigrahi, Bijaya Laxmi
- Published
- 2019
- Full Text
- View/download PDF
14. Learning Visuotactile Skills with Two Multifingered Hands
- Author
-
Lin, Toru, Zhang, Yu, Li, Qiyang, Qi, Haozhi, Yi, Brent, Levine, Sergey, and Malik, Jitendra
- Subjects
Computer Science - Robotics ,Computer Science - Artificial Intelligence ,Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
Aiming to replicate human-like dexterity, perceptual experiences, and motion patterns, we explore learning from human demonstrations using a bimanual system with multifingered hands and visuotactile data. Two significant challenges exist: the lack of an affordable and accessible teleoperation system suitable for a dual-arm setup with multifingered hands, and the scarcity of multifingered hand hardware equipped with touch sensing. To tackle the first challenge, we develop HATO, a low-cost hands-arms teleoperation system that leverages off-the-shelf electronics, complemented with a software suite that enables efficient data collection; the comprehensive software suite also supports multimodal data processing, scalable policy learning, and smooth policy deployment. To tackle the latter challenge, we introduce a novel hardware adaptation by repurposing two prosthetic hands equipped with touch sensors for research. Using visuotactile data collected from our system, we learn skills to complete long-horizon, high-precision tasks which are difficult to achieve without multifingered dexterity and touch feedback. Furthermore, we empirically investigate the effects of dataset size, sensing modality, and visual input preprocessing on policy learning. Our results mark a promising step forward in bimanual multifingered manipulation from visuotactile data. Videos, code, and datasets can be found at https://toruowo.github.io/hato/ ., Comment: Code and Project Website: https://toruowo.github.io/hato/
- Published
- 2024
15. Reconstructing Hand-Held Objects in 3D from Images and Videos
- Author
-
Wu, Jane, Pavlakos, Georgios, Gkioxari, Georgia, and Malik, Jitendra
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Objects manipulated by the hand (i.e., manipulanda) are particularly challenging to reconstruct from Internet videos. Not only does the hand occlude much of the object, but also the object is often only visible in a small number of image pixels. At the same time, two strong anchors emerge in this setting: (1) estimated 3D hands help disambiguate the location and scale of the object, and (2) the set of manipulanda is small relative to all possible objects. With these insights in mind, we present a scalable paradigm for hand-held object reconstruction that builds on recent breakthroughs in large language/vision models and 3D object datasets. Given a monocular RGB video, we aim to reconstruct hand-held object geometry in 3D, over time. In order to obtain the best performing single frame model, we first present MCC-Hand-Object (MCC-HO), which jointly reconstructs hand and object geometry given a single RGB image and inferred 3D hand as inputs. Subsequently, we prompt a text-to-3D generative model using GPT-4(V) to retrieve a 3D object model that matches the object in the image(s); we call this alignment Retrieval-Augmented Reconstruction (RAR). RAR provides unified object geometry across all frames, and the result is rigidly aligned with both the input images and 3D MCC-HO observations in a temporally consistent manner. Experiments demonstrate that our approach achieves state-of-the-art performance on lab and Internet image/video datasets. We make our code and models available on the project website: https://janehwu.github.io/mcc-ho, Comment: Project page: https://janehwu.github.io/mcc-ho
- Published
- 2024
16. Deep learning for hate speech detection: a comparative study
- Author
-
Malik, Jitendra Singh, Qiao, Hezhe, Pang, Guansong, and van den Hengel, Anton
- Published
- 2024
- Full Text
- View/download PDF
17. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
- Author
-
Khazatsky, Alexander, Pertsch, Karl, Nair, Suraj, Balakrishna, Ashwin, Dasari, Sudeep, Karamcheti, Siddharth, Nasiriany, Soroush, Srirama, Mohan Kumar, Chen, Lawrence Yunliang, Ellis, Kirsty, Fagan, Peter David, Hejna, Joey, Itkina, Masha, Lepert, Marion, Ma, Yecheng Jason, Miller, Patrick Tree, Wu, Jimmy, Belkhale, Suneel, Dass, Shivin, Ha, Huy, Jain, Arhan, Lee, Abraham, Lee, Youngwoon, Memmel, Marius, Park, Sungjae, Radosavovic, Ilija, Wang, Kaiyuan, Zhan, Albert, Black, Kevin, Chi, Cheng, Hatch, Kyle Beltran, Lin, Shan, Lu, Jingpei, Mercat, Jean, Rehman, Abdul, Sanketi, Pannag R, Sharma, Archit, Simpson, Cody, Vuong, Quan, Walke, Homer Rich, Wulfe, Blake, Xiao, Ted, Yang, Jonathan Heewon, Yavary, Arefeh, Zhao, Tony Z., Agia, Christopher, Baijal, Rohan, Castro, Mateo Guaman, Chen, Daphne, Chen, Qiuyu, Chung, Trinity, Drake, Jaimyn, Foster, Ethan Paul, Gao, Jensen, Herrera, David Antonio, Heo, Minho, Hsu, Kyle, Hu, Jiaheng, Jackson, Donovon, Le, Charlotte, Li, Yunshuang, Lin, Kevin, Lin, Roy, Ma, Zehan, Maddukuri, Abhiram, Mirchandani, Suvir, Morton, Daniel, Nguyen, Tony, O'Neill, Abigail, Scalise, Rosario, Seale, Derick, Son, Victor, Tian, Stephen, Tran, Emi, Wang, Andrew E., Wu, Yilin, Xie, Annie, Yang, Jingyun, Yin, Patrick, Zhang, Yunchu, Bastani, Osbert, Berseth, Glen, Bohg, Jeannette, Goldberg, Ken, Gupta, Abhinav, Gupta, Abhishek, Jayaraman, Dinesh, Lim, Joseph J, Malik, Jitendra, Martín-Martín, Roberto, Ramamoorthy, Subramanian, Sadigh, Dorsa, Song, Shuran, Wu, Jiajun, Yip, Michael C., Zhu, Yuke, Kollar, Thomas, Levine, Sergey, and Finn, Chelsea
- Subjects
Computer Science - Robotics - Abstract
The creation of large, diverse, high-quality robot manipulation datasets is an important stepping stone on the path toward more capable and robust robotic manipulation policies. However, creating such datasets is challenging: collecting robot manipulation data in diverse environments poses logistical and safety challenges and requires substantial investments in hardware and human labour. As a result, even the most general robot manipulation policies today are mostly trained on data collected in a small number of environments with limited scene and task diversity. In this work, we introduce DROID (Distributed Robot Interaction Dataset), a diverse robot manipulation dataset with 76k demonstration trajectories or 350 hours of interaction data, collected across 564 scenes and 84 tasks by 50 data collectors in North America, Asia, and Europe over the course of 12 months. We demonstrate that training with DROID leads to policies with higher performance and improved generalization ability. We open source the full dataset, policy learning code, and a detailed guide for reproducing our robot hardware setup., Comment: Project website: https://droid-dataset.github.io/
- Published
- 2024
18. AutoEval Done Right: Using Synthetic Data for Model Evaluation
- Author
-
Boyeau, Pierre, Angelopoulos, Anastasios N., Yosef, Nir, Malik, Jitendra, and Jordan, Michael I.
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Statistics - Methodology - Abstract
The evaluation of machine learning models using human-labeled validation data can be expensive and time-consuming. AI-labeled synthetic data can be used to decrease the number of human annotations required for this purpose in a process called autoevaluation. We suggest efficient and statistically principled algorithms for this purpose that improve sample efficiency while remaining unbiased. These algorithms increase the effective human-labeled sample size by up to 50% on experiments with GPT-4., Comment: New experiments, fix fig 1
- Published
- 2024
19. Twisting Lids Off with Two Hands
- Author
-
Lin, Toru, Yin, Zhao-Heng, Qi, Haozhi, Abbeel, Pieter, and Malik, Jitendra
- Subjects
Computer Science - Robotics ,Computer Science - Artificial Intelligence ,Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
Manipulating objects with two multi-fingered hands has been a long-standing challenge in robotics, due to the contact-rich nature of many manipulation tasks and the complexity inherent in coordinating a high-dimensional bimanual system. In this work, we share novel insights into physical modeling, real-time perception, and reward design that enable policies trained in simulation using deep reinforcement learning (RL) to be effectively and efficiently transferred to the real world. Specifically, we consider the problem of twisting lids of various bottle-like objects with two hands, demonstrating policies with generalization capabilities across a diverse set of unseen objects as well as dynamic and dexterous behaviors. To the best of our knowledge, this is the first sim-to-real RL system that enables such capabilities on bimanual multi-fingered hands., Comment: Project page can be found at https://toruowo.github.io/bimanual-twist
- Published
- 2024
20. xT: Nested Tokenization for Larger Context in Large Images
- Author
-
Gupta, Ritwik, Li, Shufan, Zhu, Tyler, Malik, Jitendra, Darrell, Trevor, and Mangalam, Karttikeya
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
Modern computer vision pipelines handle large images in one of two sub-optimal ways: down-sampling or cropping. These two methods incur significant losses in the amount of information and context present in an image. There are many downstream applications in which global context matters as much as high frequency details, such as in real-world satellite imagery; in such cases researchers have to make the uncomfortable choice of which information to discard. We introduce xT, a simple framework for vision transformers which effectively aggregates global context with local details and can model large images end-to-end on contemporary GPUs. We select a set of benchmark datasets across classic vision tasks which accurately reflect a vision model's ability to understand truly large images and incorporate fine details over large scales and assess our method's improvement on them. xT is a streaming, two-stage architecture that adapts existing vision backbones and long sequence language models to effectively model large images without quadratic memory growth. We are able to increase accuracy by up to 8.6% on challenging classification tasks and $F_1$ score by 11.6 on context-dependent segmentation on images as large as 29,000 x 29,000 pixels., Comment: Accepted to the 2024 International Conference on Machine Learning (ICML)
- Published
- 2024
21. Humanoid Locomotion as Next Token Prediction
- Author
-
Radosavovic, Ilija, Zhang, Bike, Shi, Baifeng, Rajasegaran, Jathushan, Kamat, Sarthak, Darrell, Trevor, Sreenath, Koushil, and Malik, Jitendra
- Subjects
Computer Science - Robotics ,Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
We cast real-world humanoid control as a next token prediction problem, akin to predicting the next word in language. Our model is a causal transformer trained via autoregressive prediction of sensorimotor trajectories. To account for the multi-modal nature of the data, we perform prediction in a modality-aligned way, and for each input token predict the next token from the same modality. This general formulation enables us to leverage data with missing modalities, like video trajectories without actions. We train our model on a collection of simulated trajectories coming from prior neural network policies, model-based controllers, motion capture data, and YouTube videos of humans. We show that our model enables a full-sized humanoid to walk in San Francisco zero-shot. Our model can transfer to the real world even when trained on only 27 hours of walking data, and can generalize to commands not seen during training like walking backward. These findings suggest a promising path toward learning challenging real-world control tasks by generative modeling of sensorimotor trajectories.
- Published
- 2024
22. Synthesizing Moving People with 3D Control
- Author
-
Li, Boyi, Rajasegaran, Jathushan, Gandelsman, Yossi, Efros, Alexei A., and Malik, Jitendra
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
In this paper, we present a diffusion model-based framework for animating people from a single image for a given target 3D motion sequence. Our approach has two core components: a) learning priors about invisible parts of the human body and clothing, and b) rendering novel body poses with proper clothing and texture. For the first part, we learn an in-filling diffusion model to hallucinate unseen parts of a person given a single image. We train this model on texture map space, which makes it more sample-efficient since it is invariant to pose and viewpoint. Second, we develop a diffusion-based rendering pipeline, which is controlled by 3D human poses. This produces realistic renderings of novel poses of the person, including clothing, hair, and plausible in-filling of unseen regions. This disentangled approach allows our method to generate a sequence of images that are faithful to the target motion in the 3D pose and, to the input image in terms of visual similarity. In addition to that, the 3D control allows various synthetic camera trajectories to render a person. Our experiments show that our method is resilient in generating prolonged motions and varied challenging and complex poses compared to prior methods. Please check our website for more details: https://boyiliee.github.io/3DHM.github.io/.
- Published
- 2024
23. Dr$^2$Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning
- Author
-
Zhao, Chen, Liu, Shuming, Mangalam, Karttikeya, Qian, Guocheng, Zohra, Fatimah, Alghannam, Abdulmohsen, Malik, Jitendra, and Ghanem, Bernard
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
Large pretrained models are increasingly crucial in modern computer vision tasks. These models are typically used in downstream tasks by end-to-end finetuning, which is highly memory-intensive for tasks with high-resolution data, e.g., video understanding, small object detection, and point cloud analysis. In this paper, we propose Dynamic Reversible Dual-Residual Networks, or Dr$^2$Net, a novel family of network architectures that acts as a surrogate network to finetune a pretrained model with substantially reduced memory consumption. Dr$^2$Net contains two types of residual connections, one maintaining the residual structure in the pretrained models, and the other making the network reversible. Due to its reversibility, intermediate activations, which can be reconstructed from output, are cleared from memory during training. We use two coefficients on either type of residual connections respectively, and introduce a dynamic training strategy that seamlessly transitions the pretrained model to a reversible network with much higher numerical precision. We evaluate Dr$^2$Net on various pretrained models and various tasks, and show that it can reach comparable performance to conventional finetuning but with significantly less memory usage.
- Published
- 2024
24. Adaptive Human Trajectory Prediction via Latent Corridors
- Author
-
Thakkar, Neerja, Mangalam, Karttikeya, Bajcsy, Andrea, Malik, Jitendra, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
- Published
- 2025
- Full Text
- View/download PDF
25. Neural feels with neural fields: Visuo-tactile perception for in-hand manipulation
- Author
-
Suresh, Sudharshan, Qi, Haozhi, Wu, Tingfan, Fan, Taosha, Pineda, Luis, Lambeta, Mike, Malik, Jitendra, Kalakrishnan, Mrinal, Calandra, Roberto, Kaess, Michael, Ortiz, Joseph, and Mukadam, Mustafa
- Subjects
Computer Science - Robotics ,Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
To achieve human-level dexterity, robots must infer spatial awareness from multimodal sensing to reason over contact interactions. During in-hand manipulation of novel objects, such spatial awareness involves estimating the object's pose and shape. The status quo for in-hand perception primarily employs vision, and restricts to tracking a priori known objects. Moreover, visual occlusion of objects in-hand is imminent during manipulation, preventing current systems to push beyond tasks without occlusion. We combine vision and touch sensing on a multi-fingered hand to estimate an object's pose and shape during in-hand manipulation. Our method, NeuralFeels, encodes object geometry by learning a neural field online and jointly tracks it by optimizing a pose graph problem. We study multimodal in-hand perception in simulation and the real-world, interacting with different objects via a proprioception-driven policy. Our experiments show final reconstruction F-scores of $81$% and average pose drifts of $4.7\,\text{mm}$, further reduced to $2.3\,\text{mm}$ with known CAD models. Additionally, we observe that under heavy visual occlusion we can achieve up to $94$% improvements in tracking compared to vision-only methods. Our results demonstrate that touch, at the very least, refines and, at the very best, disambiguates visual estimates during in-hand manipulation. We release our evaluation dataset of 70 experiments, FeelSight, as a step towards benchmarking in this domain. Our neural representation driven by multimodal sensing can serve as a perception backbone towards advancing robot dexterity. Videos can be found on our project website https://suddhu.github.io/neural-feels/, Comment: 43 pages, 20 figures, 1 table; https://suddhu.github.io/neural-feels/
- Published
- 2023
26. Adaptive Human Trajectory Prediction via Latent Corridors
- Author
-
Thakkar, Neerja, Mangalam, Karttikeya, Bajcsy, Andrea, and Malik, Jitendra
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Human trajectory prediction is typically posed as a zero-shot generalization problem: a predictor is learnt on a dataset of human motion in training scenes, and then deployed on unseen test scenes. While this paradigm has yielded tremendous progress, it fundamentally assumes that trends in human behavior within the deployment scene are constant over time. As such, current prediction models are unable to adapt to scene-specific transient human behaviors, such as crowds temporarily gathering to see buskers, pedestrians hurrying through the rain and avoiding puddles, or a protest breaking out. We formalize the problem of scene-specific adaptive trajectory prediction and propose a new adaptation approach inspired by prompt tuning called latent corridors. By augmenting the input of any pre-trained human trajectory predictor with learnable image prompts, the predictor can improve in the deployment scene by inferring trends from extremely small amounts of new data (e.g., 2 humans observed for 30 seconds). With less than 0.1% additional model parameters, we see up to 23.9% ADE improvement in MOTSynth simulated data and 16.4% ADE in MOT and Wildtrack real pedestrian data. Qualitatively, we observe that latent corridors imbue predictors with an awareness of scene geometry and scene-specific human behaviors that non-adaptive predictors struggle to capture. The project website can be found at https://neerja.me/atp_latent_corridors/., Comment: Accepted to ECCV 2024. Project website can be found at https://neerja.me/atp_latent_corridors/
- Published
- 2023
27. Reconstructing Hands in 3D with Transformers
- Author
-
Pavlakos, Georgios, Shan, Dandan, Radosavovic, Ilija, Kanazawa, Angjoo, Fouhey, David, and Malik, Jitendra
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
We present an approach that can reconstruct hands in 3D from monocular input. Our approach for Hand Mesh Recovery, HaMeR, follows a fully transformer-based architecture and can analyze hands with significantly increased accuracy and robustness compared to previous work. The key to HaMeR's success lies in scaling up both the data used for training and the capacity of the deep network for hand reconstruction. For training data, we combine multiple datasets that contain 2D or 3D hand annotations. For the deep model, we use a large scale Vision Transformer architecture. Our final model consistently outperforms the previous baselines on popular 3D hand pose benchmarks. To further evaluate the effect of our design in non-controlled settings, we annotate existing in-the-wild datasets with 2D hand keypoint annotations. On this newly collected dataset of annotations, HInt, we demonstrate significant improvements over existing baselines. We make our code, data and models available on the project website: https://geopavlakos.github.io/hamer/.
- Published
- 2023
28. Sequential Modeling Enables Scalable Learning for Large Vision Models
- Author
-
Bai, Yutong, Geng, Xinyang, Mangalam, Karttikeya, Bar, Amir, Yuille, Alan, Darrell, Trevor, Malik, Jitendra, and Efros, Alexei A
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. To do this, we define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources such as semantic segmentations and depth reconstructions without needing any meta-knowledge beyond the pixels. Once this wide variety of visual data (comprising 420 billion tokens) is represented as sequences, the model can be trained to minimize a cross-entropy loss for next token prediction. By training across various scales of model architecture and data diversity, we provide empirical evidence that our models scale effectively. Many different vision tasks can be solved by designing suitable visual prompts at test time., Comment: Website: https://yutongbai.com/lvm.html
- Published
- 2023
29. Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
- Author
-
Grauman, Kristen, Westbury, Andrew, Torresani, Lorenzo, Kitani, Kris, Malik, Jitendra, Afouras, Triantafyllos, Ashutosh, Kumar, Baiyya, Vijay, Bansal, Siddhant, Boote, Bikram, Byrne, Eugene, Chavis, Zach, Chen, Joya, Cheng, Feng, Chu, Fu-Jen, Crane, Sean, Dasgupta, Avijit, Dong, Jing, Escobar, Maria, Forigua, Cristhian, Gebreselasie, Abrham, Haresh, Sanjay, Huang, Jing, Islam, Md Mohaiminul, Jain, Suyog, Khirodkar, Rawal, Kukreja, Devansh, Liang, Kevin J, Liu, Jia-Wei, Majumder, Sagnik, Mao, Yongsen, Martin, Miguel, Mavroudi, Effrosyni, Nagarajan, Tushar, Ragusa, Francesco, Ramakrishnan, Santhosh Kumar, Seminara, Luigi, Somayazulu, Arjun, Song, Yale, Su, Shan, Xue, Zihui, Zhang, Edward, Zhang, Jinxu, Castillo, Angela, Chen, Changan, Fu, Xinzhu, Furuta, Ryosuke, Gonzalez, Cristina, Gupta, Prince, Hu, Jiabo, Huang, Yifei, Huang, Yiming, Khoo, Weslie, Kumar, Anush, Kuo, Robert, Lakhavani, Sach, Liu, Miao, Luo, Mi, Luo, Zhengyi, Meredith, Brighid, Miller, Austin, Oguntola, Oluwatumininu, Pan, Xiaqing, Peng, Penny, Pramanick, Shraman, Ramazanova, Merey, Ryan, Fiona, Shan, Wei, Somasundaram, Kiran, Song, Chenan, Southerland, Audrey, Tateno, Masatoshi, Wang, Huiyu, Wang, Yuchen, Yagi, Takuma, Yan, Mingfei, Yang, Xitong, Yu, Zecheng, Zha, Shengxin Cindy, Zhao, Chen, Zhao, Ziwei, Zhu, Zhifan, Zhuo, Jeff, Arbelaez, Pablo, Bertasius, Gedas, Crandall, David, Damen, Dima, Engel, Jakob, Farinella, Giovanni Maria, Furnari, Antonino, Ghanem, Bernard, Hoffman, Judy, Jawahar, C. V., Newcombe, Richard, Park, Hyun Soo, Rehg, James M., Sato, Yoichi, Savva, Manolis, Shi, Jianbo, Shou, Mike Zheng, and Wray, Michael
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from 1 to 42 minutes each and 1,286 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired language descriptions -- including a novel "expert commentary" done by coaches and teachers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled human activity, we also present a suite of benchmark tasks and their annotations, including fine-grained activity understanding, proficiency estimation, cross-view translation, and 3D hand/body pose. All resources are open sourced to fuel new research in the community. Project page: http://ego-exo4d-data.org/, Comment: Expanded manuscript (compared to arxiv v1 from Nov 2023 and CVPR 2024 paper from June 2024) for more comprehensive dataset and benchmark presentation, plus new results on v2 data release
- Published
- 2023
30. GOAT: GO to Any Thing
- Author
-
Chang, Matthew, Gervet, Theophile, Khanna, Mukul, Yenamandra, Sriram, Shah, Dhruv, Min, So Yeon, Shah, Kavit, Paxton, Chris, Gupta, Saurabh, Batra, Dhruv, Mottaghi, Roozbeh, Malik, Jitendra, and Chaplot, Devendra Singh
- Subjects
Computer Science - Robotics - Abstract
In deployment scenarios such as homes and warehouses, mobile robots are expected to autonomously navigate for extended periods, seamlessly executing tasks articulated in terms that are intuitively understandable by human operators. We present GO To Any Thing (GOAT), a universal navigation system capable of tackling these requirements with three key features: a) Multimodal: it can tackle goals specified via category labels, target images, and language descriptions, b) Lifelong: it benefits from its past experience in the same environment, and c) Platform Agnostic: it can be quickly deployed on robots with different embodiments. GOAT is made possible through a modular system design and a continually augmented instance-aware semantic memory that keeps track of the appearance of objects from different viewpoints in addition to category-level semantics. This enables GOAT to distinguish between different instances of the same category to enable navigation to targets specified by images and language descriptions. In experimental comparisons spanning over 90 hours in 9 different homes consisting of 675 goals selected across 200+ different object instances, we find GOAT achieves an overall success rate of 83%, surpassing previous methods and ablations by 32% (absolute improvement). GOAT improves with experience in the environment, from a 60% success rate at the first goal to a 90% success after exploration. In addition, we demonstrate that GOAT can readily be applied to downstream tasks such as pick and place and social navigation.
- Published
- 2023
31. Conformal Policy Learning for Sensorimotor Control Under Distribution Shifts
- Author
-
Huang, Huang, Sharma, Satvik, Loquercio, Antonio, Angelopoulos, Anastasios, Goldberg, Ken, and Malik, Jitendra
- Subjects
Computer Science - Robotics ,Computer Science - Artificial Intelligence - Abstract
This paper focuses on the problem of detecting and reacting to changes in the distribution of a sensorimotor controller's observables. The key idea is the design of switching policies that can take conformal quantiles as input, which we define as conformal policy learning, that allows robots to detect distribution shifts with formal statistical guarantees. We show how to design such policies by using conformal quantiles to switch between base policies with different characteristics, e.g. safety or speed, or directly augmenting a policy observation with a quantile and training it with reinforcement learning. Theoretically, we show that such policies achieve the formal convergence guarantees in finite time. In addition, we thoroughly evaluate their advantages and limitations on two compelling use cases: simulated autonomous driving and active perception with a physical quadruped. Empirical results demonstrate that our approach outperforms five baselines. It is also the simplest of the baseline strategies besides one ablation. Being easy to use, flexible, and with formal guarantees, our work demonstrates how conformal prediction can be an effective tool for sensorimotor learning under uncertainty., Comment: Conformal Policy Learning
- Published
- 2023
32. Habitat 3.0: A Co-Habitat for Humans, Avatars and Robots
- Author
-
Puig, Xavier, Undersander, Eric, Szot, Andrew, Cote, Mikael Dallaire, Yang, Tsung-Yen, Partsey, Ruslan, Desai, Ruta, Clegg, Alexander William, Hlavac, Michal, Min, So Yeon, Vondruš, Vladimír, Gervet, Theophile, Berges, Vincent-Pierre, Turner, John M., Maksymets, Oleksandr, Kira, Zsolt, Kalakrishnan, Mrinal, Malik, Jitendra, Chaplot, Devendra Singh, Jain, Unnat, Batra, Dhruv, Rai, Akshara, and Mottaghi, Roozbeh
- Subjects
Computer Science - Human-Computer Interaction ,Computer Science - Artificial Intelligence ,Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Graphics ,Computer Science - Multiagent Systems ,Computer Science - Robotics - Abstract
We present Habitat 3.0: a simulation platform for studying collaborative human-robot tasks in home environments. Habitat 3.0 offers contributions across three dimensions: (1) Accurate humanoid simulation: addressing challenges in modeling complex deformable bodies and diversity in appearance and motion, all while ensuring high simulation speed. (2) Human-in-the-loop infrastructure: enabling real human interaction with simulated robots via mouse/keyboard or a VR interface, facilitating evaluation of robot policies with human input. (3) Collaborative tasks: studying two collaborative tasks, Social Navigation and Social Rearrangement. Social Navigation investigates a robot's ability to locate and follow humanoid avatars in unseen environments, whereas Social Rearrangement addresses collaboration between a humanoid and robot while rearranging a scene. These contributions allow us to study end-to-end learned and heuristic baselines for human-robot collaboration in-depth, as well as evaluate them with humans in the loop. Our experiments demonstrate that learned robot policies lead to efficient task completion when collaborating with unseen humanoid agents and human partners that might exhibit behaviors that the robot has not seen before. Additionally, we observe emergent behaviors during collaborative task execution, such as the robot yielding space when obstructing a humanoid agent, thereby allowing the effective completion of the task by the humanoid agent. Furthermore, our experiments using the human-in-the-loop tool demonstrate that our automated evaluation with humanoids can provide an indication of the relative ordering of different policies when evaluated with real human collaborators. Habitat 3.0 unlocks interesting new features in simulators for Embodied AI, and we hope it paves the way for a new frontier of embodied human-AI interaction capabilities., Comment: Project page: http://aihabitat.org/habitat3
- Published
- 2023
33. Interactive Task Planning with Language Models
- Author
-
Li, Boyi, Wu, Philipp, Abbeel, Pieter, and Malik, Jitendra
- Subjects
Computer Science - Robotics ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Computer Science - Human-Computer Interaction - Abstract
An interactive robot framework accomplishes long-horizon task planning and can easily generalize to new goals or distinct tasks, even during execution. However, most traditional methods require predefined module design, which makes it hard to generalize to different goals. Recent large language model based approaches can allow for more open-ended planning but often require heavy prompt engineering or domain-specific pretrained models. To tackle this, we propose a simple framework that achieves interactive task planning with language models. Our system incorporates both high-level planning and low-level function execution via language. We verify the robustness of our system in generating novel high-level instructions for unseen objectives and its ease of adaptation to different tasks by merely substituting the task guidelines, without the need for additional complex prompt engineering. Furthermore, when the user sends a new request, our system is able to replan accordingly with precision based on the new request, task guidelines and previously executed steps. Please check more details on our https://wuphilipp.github.io/itp_site and https://youtu.be/TrKLuyv26_g.
- Published
- 2023
34. Open X-Embodiment: Robotic Learning Datasets and RT-X Models
- Author
-
Collaboration, Open X-Embodiment, O'Neill, Abby, Rehman, Abdul, Gupta, Abhinav, Maddukuri, Abhiram, Gupta, Abhishek, Padalkar, Abhishek, Lee, Abraham, Pooley, Acorn, Gupta, Agrim, Mandlekar, Ajay, Jain, Ajinkya, Tung, Albert, Bewley, Alex, Herzog, Alex, Irpan, Alex, Khazatsky, Alexander, Rai, Anant, Gupta, Anchit, Wang, Andrew, Kolobov, Andrey, Singh, Anikait, Garg, Animesh, Kembhavi, Aniruddha, Xie, Annie, Brohan, Anthony, Raffin, Antonin, Sharma, Archit, Yavary, Arefeh, Jain, Arhan, Balakrishna, Ashwin, Wahid, Ayzaan, Burgess-Limerick, Ben, Kim, Beomjoon, Schölkopf, Bernhard, Wulfe, Blake, Ichter, Brian, Lu, Cewu, Xu, Charles, Le, Charlotte, Finn, Chelsea, Wang, Chen, Xu, Chenfeng, Chi, Cheng, Huang, Chenguang, Chan, Christine, Agia, Christopher, Pan, Chuer, Fu, Chuyuan, Devin, Coline, Xu, Danfei, Morton, Daniel, Driess, Danny, Chen, Daphne, Pathak, Deepak, Shah, Dhruv, Büchler, Dieter, Jayaraman, Dinesh, Kalashnikov, Dmitry, Sadigh, Dorsa, Johns, Edward, Foster, Ethan, Liu, Fangchen, Ceola, Federico, Xia, Fei, Zhao, Feiyu, Frujeri, Felipe Vieira, Stulp, Freek, Zhou, Gaoyue, Sukhatme, Gaurav S., Salhotra, Gautam, Yan, Ge, Feng, Gilbert, Schiavi, Giulio, Berseth, Glen, Kahn, Gregory, Yang, Guangwen, Wang, Guanzhi, Su, Hao, Fang, Hao-Shu, Shi, Haochen, Bao, Henghui, Amor, Heni Ben, Christensen, Henrik I, Furuta, Hiroki, Bharadhwaj, Homanga, Walke, Homer, Fang, Hongjie, Ha, Huy, Mordatch, Igor, Radosavovic, Ilija, Leal, Isabel, Liang, Jacky, Abou-Chakra, Jad, Kim, Jaehyung, Drake, Jaimyn, Peters, Jan, Schneider, Jan, Hsu, Jasmine, Vakil, Jay, Bohg, Jeannette, Bingham, Jeffrey, Wu, Jeffrey, Gao, Jensen, Hu, Jiaheng, Wu, Jiajun, Wu, Jialin, Sun, Jiankai, Luo, Jianlan, Gu, Jiayuan, Tan, Jie, Oh, Jihoon, Wu, Jimmy, Lu, Jingpei, Yang, Jingyun, Malik, Jitendra, Silvério, João, Hejna, Joey, Booher, Jonathan, Tompson, Jonathan, Yang, Jonathan, Salvador, Jordi, Lim, Joseph J., Han, Junhyek, Wang, Kaiyuan, Rao, Kanishka, Pertsch, Karl, Hausman, Karol, Go, Keegan, Gopalakrishnan, Keerthana, Goldberg, Ken, Byrne, Kendra, Oslund, Kenneth, Kawaharazuka, Kento, Black, Kevin, Lin, Kevin, Zhang, Kevin, Ehsani, Kiana, Lekkala, Kiran, Ellis, Kirsty, Rana, Krishan, Srinivasan, Krishnan, Fang, Kuan, Singh, Kunal Pratap, Zeng, Kuo-Hao, Hatch, Kyle, Hsu, Kyle, Itti, Laurent, Chen, Lawrence Yunliang, Pinto, Lerrel, Fei-Fei, Li, Tan, Liam, Fan, Linxi "Jim", Ott, Lionel, Lee, Lisa, Weihs, Luca, Chen, Magnum, Lepert, Marion, Memmel, Marius, Tomizuka, Masayoshi, Itkina, Masha, Castro, Mateo Guaman, Spero, Max, Du, Maximilian, Ahn, Michael, Yip, Michael C., Zhang, Mingtong, Ding, Mingyu, Heo, Minho, Srirama, Mohan Kumar, Sharma, Mohit, Kim, Moo Jin, Kanazawa, Naoaki, Hansen, Nicklas, Heess, Nicolas, Joshi, Nikhil J, Suenderhauf, Niko, Liu, Ning, Di Palo, Norman, Shafiullah, Nur Muhammad Mahi, Mees, Oier, Kroemer, Oliver, Bastani, Osbert, Sanketi, Pannag R, Miller, Patrick "Tree", Yin, Patrick, Wohlhart, Paul, Xu, Peng, Fagan, Peter David, Mitrano, Peter, Sermanet, Pierre, Abbeel, Pieter, Sundaresan, Priya, Chen, Qiuyu, Vuong, Quan, Rafailov, Rafael, Tian, Ran, Doshi, Ria, Mart'in-Mart'in, Roberto, Baijal, Rohan, Scalise, Rosario, Hendrix, Rose, Lin, Roy, Qian, Runjia, Zhang, Ruohan, Mendonca, Russell, Shah, Rutav, Hoque, Ryan, Julian, Ryan, Bustamante, Samuel, Kirmani, Sean, Levine, Sergey, Lin, Shan, Moore, Sherry, Bahl, Shikhar, Dass, Shivin, Sonawani, Shubham, Tulsiani, Shubham, Song, Shuran, Xu, Sichun, Haldar, Siddhant, Karamcheti, Siddharth, Adebola, Simeon, Guist, Simon, Nasiriany, Soroush, Schaal, Stefan, Welker, Stefan, Tian, Stephen, Ramamoorthy, Subramanian, Dasari, Sudeep, Belkhale, Suneel, Park, Sungjae, Nair, Suraj, Mirchandani, Suvir, Osa, Takayuki, Gupta, Tanmay, Harada, Tatsuya, Matsushima, Tatsuya, Xiao, Ted, Kollar, Thomas, Yu, Tianhe, Ding, Tianli, Davchev, Todor, Zhao, Tony Z., Armstrong, Travis, Darrell, Trevor, Chung, Trinity, Jain, Vidhi, Kumar, Vikash, Vanhoucke, Vincent, Zhan, Wei, Zhou, Wenxuan, Burgard, Wolfram, Chen, Xi, Chen, Xiangyu, Wang, Xiaolong, Zhu, Xinghao, Geng, Xinyang, Liu, Xiyuan, Liangwei, Xu, Li, Xuanlin, Pang, Yansong, Lu, Yao, Ma, Yecheng Jason, Kim, Yejin, Chebotar, Yevgen, Zhou, Yifan, Zhu, Yifeng, Wu, Yilin, Xu, Ying, Wang, Yixuan, Bisk, Yonatan, Dou, Yongqiang, Cho, Yoonyoung, Lee, Youngwoon, Cui, Yuchen, Cao, Yue, Wu, Yueh-Hua, Tang, Yujin, Zhu, Yuke, Zhang, Yunchu, Jiang, Yunfan, Li, Yunshuang, Li, Yunzhu, Iwasawa, Yusuke, Matsuo, Yutaka, Ma, Zehan, Xu, Zhuo, Cui, Zichen Jeff, Zhang, Zichen, Fu, Zipeng, and Lin, Zipeng
- Subjects
Computer Science - Robotics - Abstract
Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train generalist X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. More details can be found on the project website https://robotics-transformer-x.github.io., Comment: Project website: https://robotics-transformer-x.github.io
- Published
- 2023
35. What Matters to You? Towards Visual Representation Alignment for Robot Learning
- Author
-
Tian, Ran, Xu, Chenfeng, Tomizuka, Masayoshi, Malik, Jitendra, and Bajcsy, Andrea
- Subjects
Computer Science - Robotics ,Computer Science - Artificial Intelligence ,Computer Science - Computer Vision and Pattern Recognition - Abstract
When operating in service of people, robots need to optimize rewards aligned with end-user preferences. Since robots will rely on raw perceptual inputs like RGB images, their rewards will inevitably use visual representations. Recently there has been excitement in using representations from pre-trained visual models, but key to making these work in robotics is fine-tuning, which is typically done via proxy tasks like dynamics prediction or enforcing temporal cycle-consistency. However, all these proxy tasks bypass the human's input on what matters to them, exacerbating spurious correlations and ultimately leading to robot behaviors that are misaligned with user preferences. In this work, we propose that robots should leverage human feedback to align their visual representations with the end-user and disentangle what matters for the task. We propose Representation-Aligned Preference-based Learning (RAPL), a method for solving the visual representation alignment problem and visual reward learning problem through the lens of preference-based learning and optimal transport. Across experiments in X-MAGICAL and in robotic manipulation, we find that RAPL's reward consistently generates preferred robot behaviors with high sample efficiency, and shows strong zero-shot generalization when the visual representation is learned from a different embodiment than the robot's.
- Published
- 2023
36. Conformal Decision Theory: Safe Autonomous Decisions from Imperfect Predictions
- Author
-
Lekeufack, Jordan, Angelopoulos, Anastasios N., Bajcsy, Andrea, Jordan, Michael I., and Malik, Jitendra
- Subjects
Statistics - Machine Learning ,Computer Science - Machine Learning ,Computer Science - Robotics ,Statistics - Methodology - Abstract
We introduce Conformal Decision Theory, a framework for producing safe autonomous decisions despite imperfect machine learning predictions. Examples of such decisions are ubiquitous, from robot planning algorithms that rely on pedestrian predictions, to calibrating autonomous manufacturing to exhibit high throughput and low error, to the choice of trusting a nominal policy versus switching to a safe backup policy at run-time. The decisions produced by our algorithms are safe in the sense that they come with provable statistical guarantees of having low risk without any assumptions on the world model whatsoever; the observations need not be I.I.D. and can even be adversarial. The theory extends results from conformal prediction to calibrate decisions directly, without requiring the construction of prediction sets. Experiments demonstrate the utility of our approach in robot motion planning around humans, automated stock trading, and robot manufacturing., Comment: 8 pages, 5 figures
- Published
- 2023
37. General In-Hand Object Rotation with Vision and Touch
- Author
-
Qi, Haozhi, Yi, Brent, Suresh, Sudharshan, Lambeta, Mike, Ma, Yi, Calandra, Roberto, and Malik, Jitendra
- Subjects
Computer Science - Robotics ,Computer Science - Artificial Intelligence ,Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
We introduce RotateIt, a system that enables fingertip-based object rotation along multiple axes by leveraging multimodal sensory inputs. Our system is trained in simulation, where it has access to ground-truth object shapes and physical properties. Then we distill it to operate on realistic yet noisy simulated visuotactile and proprioceptive sensory inputs. These multimodal inputs are fused via a visuotactile transformer, enabling online inference of object shapes and physical properties during deployment. We show significant performance improvements over prior methods and the importance of visual and tactile sensing., Comment: CoRL 2023; Website: https://haozhi.io/rotateit/
- Published
- 2023
38. Learning Vision-based Pursuit-Evasion Robot Policies
- Author
-
Bajcsy, Andrea, Loquercio, Antonio, Kumar, Ashish, and Malik, Jitendra
- Subjects
Computer Science - Robotics ,Computer Science - Artificial Intelligence - Abstract
Learning strategic robot behavior -- like that required in pursuit-evasion interactions -- under real-world constraints is extremely challenging. It requires exploiting the dynamics of the interaction, and planning through both physical state and latent intent uncertainty. In this paper, we transform this intractable problem into a supervised learning problem, where a fully-observable robot policy generates supervision for a partially-observable one. We find that the quality of the supervision signal for the partially-observable pursuer policy depends on two key factors: the balance of diversity and optimality of the evader's behavior and the strength of the modeling assumptions in the fully-observable policy. We deploy our policy on a physical quadruped robot with an RGB-D camera on pursuit-evasion interactions in the wild. Despite all the challenges, the sensing constraints bring about creativity: the robot is pushed to gather information when uncertain, predict intent from noisy measurements, and anticipate in order to intercept. Project webpage: https://abajcsy.github.io/vision-based-pursuit/, Comment: Includes Supplementary. Project webpage at https://abajcsy.github.io/vision-based-pursuit/
- Published
- 2023
39. EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding
- Author
-
Mangalam, Karttikeya, Akshulakov, Raiymbek, and Malik, Jitendra
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language - Abstract
We introduce EgoSchema, a very long-form video question-answering dataset, and benchmark to evaluate long video understanding capabilities of modern vision and language systems. Derived from Ego4D, EgoSchema consists of over 5000 human curated multiple choice question answer pairs, spanning over 250 hours of real video data, covering a very broad range of natural human activity and behavior. For each question, EgoSchema requires the correct answer to be selected between five given options based on a three-minute-long video clip. While some prior works have proposed video datasets with long clip lengths, we posit that merely the length of the video clip does not truly capture the temporal difficulty of the video task that is being considered. To remedy this, we introduce temporal certificate sets, a general notion for capturing the intrinsic temporal understanding length associated with a broad range of video understanding tasks & datasets. Based on this metric, we find EgoSchema to have intrinsic temporal lengths over 5.7x longer than the second closest dataset and 10x to 100x longer than any other video understanding dataset. Further, our evaluation of several current state-of-the-art video and language models shows them to be severely lacking in long-term video understanding capabilities. Even models with several billions of parameters achieve QA accuracy less than 33% (random is 20%) on the EgoSchema multi-choice question answering task, while humans achieve about 76% accuracy. We posit that \name{}{}, with its long intrinsic temporal structures and diverse complexity, would serve as a valuable evaluation probe for developing effective long-term video understanding systems in the future. Data and Zero-shot model evaluation code are open-sourced for both public and commercial use under the Ego4D license at http://egoschema.github.io, Comment: https://egoschema.github.io/
- Published
- 2023
40. Learning Space-Time Semantic Correspondences
- Author
-
Tran, Du and Malik, Jitendra
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
We propose a new task of space-time semantic correspondence prediction in videos. Given a source video, a target video, and a set of space-time key-points in the source video, the task requires predicting a set of keypoints in the target video that are the semantic correspondences of the provided source keypoints. We believe that this task is important for fine-grain video understanding, potentially enabling applications such as activity coaching, sports analysis, robot imitation learning, and more. Our contributions in this paper are: (i) proposing a new task and providing annotations for space-time semantic correspondences on two existing benchmarks: Penn Action and Pouring; and (ii) presenting a comprehensive set of baselines and experiments to gain insights about the new problem. Our main finding is that the space-time semantic correspondence prediction problem is best approached jointly in space and time rather than in their decomposed sub-problems: time alignment and spatial correspondences.
- Published
- 2023
41. Robot Learning with Sensorimotor Pre-training
- Author
-
Radosavovic, Ilija, Shi, Baifeng, Fu, Letian, Goldberg, Ken, Darrell, Trevor, and Malik, Jitendra
- Subjects
Computer Science - Robotics ,Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
We present a self-supervised sensorimotor pre-training approach for robotics. Our model, called RPT, is a Transformer that operates on sequences of sensorimotor tokens. Given a sequence of camera images, proprioceptive robot states, and actions, we encode the sequence into tokens, mask out a subset, and train a model to predict the missing content from the rest. We hypothesize that if a robot can predict the masked-out content it will have acquired a good model of the physical world that can enable it to act. RPT is designed to operate on latent visual representations which makes prediction tractable, enables scaling to larger models, and allows fast inference on a real robot. To evaluate our approach, we collected a dataset of 20,000 real-world trajectories over 9 months using a combination of motion planning and grasping algorithms. We find that sensorimotor pre-training consistently outperforms training from scratch, has favorable scaling properties, and enables transfer across different tasks, environments, and robots., Comment: CoRL 2023; Project page: https://robotic-pretrained-transformer.github.io
- Published
- 2023
42. Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
- Author
-
Ryali, Chaitanya, Hu, Yuan-Ting, Bolya, Daniel, Wei, Chen, Fan, Haoqi, Huang, Po-Yao, Aggarwal, Vaibhav, Chowdhury, Arkabandhu, Poursaeed, Omid, Hoffman, Judy, Malik, Jitendra, Li, Yanghao, and Feichtenhofer, Christoph
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance. While these components lead to effective accuracies and attractive FLOP counts, the added complexity actually makes these transformers slower than their vanilla ViT counterparts. In this paper, we argue that this additional bulk is unnecessary. By pretraining with a strong visual pretext task (MAE), we can strip out all the bells-and-whistles from a state-of-the-art multi-stage vision transformer without losing accuracy. In the process, we create Hiera, an extremely simple hierarchical vision transformer that is more accurate than previous models while being significantly faster both at inference and during training. We evaluate Hiera on a variety of tasks for image and video recognition. Our code and models are available at https://github.com/facebookresearch/hiera., Comment: ICML 2023 Oral version. Code+Models: https://github.com/facebookresearch/hiera
- Published
- 2023
43. Humans in 4D: Reconstructing and Tracking Humans with Transformers
- Author
-
Goel, Shubham, Pavlakos, Georgios, Rajasegaran, Jathushan, Kanazawa, Angjoo, and Malik, Jitendra
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
We present an approach to reconstruct humans and track them over time. At the core of our approach, we propose a fully "transformerized" version of a network for human mesh recovery. This network, HMR 2.0, advances the state of the art and shows the capability to analyze unusual poses that have in the past been difficult to reconstruct from single images. To analyze video, we use 3D reconstructions from HMR 2.0 as input to a tracking system that operates in 3D. This enables us to deal with multiple people and maintain identities through occlusion events. Our complete approach, 4DHumans, achieves state-of-the-art results for tracking people from monocular video. Furthermore, we demonstrate the effectiveness of HMR 2.0 on the downstream task of action recognition, achieving significant improvements over previous pose-based action recognition approaches. Our code and models are available on the project website: https://shubham-goel.github.io/4dhumans/., Comment: In ICCV 2023. Project Webpage: https://shubham-goel.github.io/4dhumans/
- Published
- 2023
44. Manipulator as a Tail: Promoting Dynamic Stability for Legged Locomotion
- Author
-
Huang, Huang, Loquercio, Antonio, Kumar, Ashish, Thakkar, Neerja, Goldberg, Ken, and Malik, Jitendra
- Subjects
Computer Science - Robotics - Abstract
For locomotion, is an arm on a legged robot a liability or an asset for locomotion? Biological systems evolved additional limbs beyond legs that facilitates postural control. This work shows how a manipulator can be an asset for legged locomotion at high speeds or under external perturbations, where the arm serves beyond manipulation. Since the system has 15 degrees of freedom (twelve for the legged robot and three for the arm), off-the-shelf reinforcement learning (RL) algorithms struggle to learn effective locomotion policies. Inspired by Bernstein's neurophysiological theory of animal motor learning, we develop an incremental training procedure that initially freezes some degrees of freedom and gradually releases them, using behaviour cloning (BC) from an early learning procedure to guide optimization in later learning. Simulation experiments show that our policy increases the success rate by up to 61 percentage points over the baselines. Simulation and real robot experiments suggest that our policy learns to use the arm as a tail to initiate robot turning at high speeds and to stabilize the quadruped under external perturbations. Quantitatively, in simulation experiments, we cut the failure rate up to 43.6% during high-speed turning and up to 31.8% for quadruped under external forces compared to using a locked arm.
- Published
- 2023
45. On the Benefits of 3D Pose and Tracking for Human Action Recognition
- Author
-
Rajasegaran, Jathushan, Pavlakos, Georgios, Kanazawa, Angjoo, Feichtenhofer, Christoph, and Malik, Jitendra
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
In this work we study the benefits of using tracking and 3D poses for action recognition. To achieve this, we take the Lagrangian view on analysing actions over a trajectory of human motion rather than at a fixed point in space. Taking this stand allows us to use the tracklets of people to predict their actions. In this spirit, first we show the benefits of using 3D pose to infer actions, and study person-person interactions. Subsequently, we propose a Lagrangian Action Recognition model by fusing 3D pose and contextualized appearance over tracklets. To this end, our method achieves state-of-the-art performance on the AVA v2.2 dataset on both pose only settings and on standard benchmark settings. When reasoning about the action using only pose cues, our pose model achieves +10.0 mAP gain over the corresponding state-of-the-art while our fused model has a gain of +2.8 mAP over the best state-of-the-art model. Code and results are available at: https://brjathu.github.io/LART, Comment: CVPR2023 (project page: https://brjathu.github.io/LART)
- Published
- 2023
46. Navigating to Objects Specified by Images
- Author
-
Krantz, Jacob, Gervet, Theophile, Yadav, Karmesh, Wang, Austin, Paxton, Chris, Mottaghi, Roozbeh, Batra, Dhruv, Malik, Jitendra, Lee, Stefan, and Chaplot, Devendra Singh
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Robotics - Abstract
Images are a convenient way to specify which particular object instance an embodied agent should navigate to. Solving this task requires semantic visual reasoning and exploration of unknown environments. We present a system that can perform this task in both simulation and the real world. Our modular method solves sub-tasks of exploration, goal instance re-identification, goal localization, and local navigation. We re-identify the goal instance in egocentric vision using feature-matching and localize the goal instance by projecting matched features to a map. Each sub-task is solved using off-the-shelf components requiring zero fine-tuning. On the HM3D InstanceImageNav benchmark, this system outperforms a baseline end-to-end RL policy 7x and a state-of-the-art ImageNav model 2.3x (56% vs 25% success). We deploy this system to a mobile robot platform and demonstrate effective real-world performance, achieving an 88% success rate across a home and an office environment.
- Published
- 2023
47. Effect of phosphorus and bio-fertilizer on productivity, nutrient uptake and economics of pigeonpea (Cajanus Cajan) + mungbean (Phaseolus Radiatus) intercropping system
- Author
-
Singh, Ravindra, Malik, Jitendra Kumar, Thenua, O.V.S., and Jat, H.S.
- Published
- 2013
48. Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?
- Author
-
Majumdar, Arjun, Yadav, Karmesh, Arnaud, Sergio, Ma, Yecheng Jason, Chen, Claire, Silwal, Sneha, Jain, Aryan, Berges, Vincent-Pierre, Abbeel, Pieter, Malik, Jitendra, Batra, Dhruv, Lin, Yixin, Maksymets, Oleksandr, Rajeswaran, Aravind, and Meier, Franziska
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning ,Computer Science - Robotics - Abstract
We present the largest and most comprehensive empirical study of pre-trained visual representations (PVRs) or visual 'foundation models' for Embodied AI. First, we curate CortexBench, consisting of 17 different tasks spanning locomotion, navigation, dexterous, and mobile manipulation. Next, we systematically evaluate existing PVRs and find that none are universally dominant. To study the effect of pre-training data size and diversity, we combine over 4,000 hours of egocentric videos from 7 different sources (over 4.3M images) and ImageNet to train different-sized vision transformers using Masked Auto-Encoding (MAE) on slices of this data. Contrary to inferences from prior work, we find that scaling dataset size and diversity does not improve performance universally (but does so on average). Our largest model, named VC-1, outperforms all prior PVRs on average but does not universally dominate either. Next, we show that task- or domain-specific adaptation of VC-1 leads to substantial gains, with VC-1 (adapted) achieving competitive or superior performance than the best known results on all of the benchmarks in CortexBench. Finally, we present real-world hardware experiments, in which VC-1 and VC-1 (adapted) outperform the strongest pre-existing PVR. Overall, this paper presents no new techniques but a rigorous systematic evaluation, a broad set of findings about PVRs (that in some cases, refute those made in narrow domains in prior work), and open-sourced code and models (that required over 10,000 GPU-hours to train) for the benefit of the research community., Comment: Project website: https://eai-vc.github.io
- Published
- 2023
49. Real-World Humanoid Locomotion with Reinforcement Learning
- Author
-
Radosavovic, Ilija, Xiao, Tete, Zhang, Bike, Darrell, Trevor, Malik, Jitendra, and Sreenath, Koushil
- Subjects
Computer Science - Robotics ,Computer Science - Machine Learning - Abstract
Humanoid robots that can autonomously operate in diverse environments have the potential to help address labour shortages in factories, assist elderly at homes, and colonize new planets. While classical controllers for humanoid robots have shown impressive results in a number of settings, they are challenging to generalize and adapt to new environments. Here, we present a fully learning-based approach for real-world humanoid locomotion. Our controller is a causal transformer that takes the history of proprioceptive observations and actions as input and predicts the next action. We hypothesize that the observation-action history contains useful information about the world that a powerful transformer model can use to adapt its behavior in-context, without updating its weights. We train our model with large-scale model-free reinforcement learning on an ensemble of randomized environments in simulation and deploy it to the real world zero-shot. Our controller can walk over various outdoor terrains, is robust to external disturbances, and can adapt in context., Comment: Project page: https://learning-humanoid-locomotion.github.io
- Published
- 2023
50. Decoupling Human and Camera Motion from Videos in the Wild
- Author
-
Ye, Vickie, Pavlakos, Georgios, Malik, Jitendra, and Kanazawa, Angjoo
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
We propose a method to reconstruct global human trajectories from videos in the wild. Our optimization method decouples the camera and human motion, which allows us to place people in the same world coordinate frame. Most existing methods do not model the camera motion; methods that rely on the background pixels to infer 3D human motion usually require a full scene reconstruction, which is often not possible for in-the-wild videos. However, even when existing SLAM systems cannot recover accurate scene reconstructions, the background pixel motion still provides enough signal to constrain the camera motion. We show that relative camera estimates along with data-driven human motion priors can resolve the scene scale ambiguity and recover global human trajectories. Our method robustly recovers the global 3D trajectories of people in challenging in-the-wild videos, such as PoseTrack. We quantify our improvement over existing methods on 3D human dataset Egobody. We further demonstrate that our recovered camera scale allows us to reason about motion of multiple people in a shared coordinate frame, which improves performance of downstream tracking in PoseTrack. Code and video results can be found at https://vye16.github.io/slahmr., Comment: Project site: https://vye16.github.io/slahmr. CVPR 2023
- Published
- 2023
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.