Author: "Silvio Savarese" / Topic: 02 engineering and technology - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Silvio Savarese"' showing total 72 results

Start Over Author "Silvio Savarese" Topic 02 engineering and technology

72 results on '"Silvio Savarese"'

1. JRDB: A Dataset and Benchmark of Egocentric Robot Visual Perception of Humans in Built Environments

Author: Eric H. Frankel, JunYoung Gwak, Amir Sadeghian, Mihir Patel, Roberto Martín-Martín, Silvio Savarese, Hamid Rezatofighi, and Abhijeet Shenoi
Subjects: FOS: Computer and information sciences, Visual perception, Computer science, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Point cloud, 02 engineering and technology, Computer Science - Robotics, Artificial Intelligence, 0202 electrical engineering, electronic engineering, information engineering, Computer vision, Social robot, Audio signal, Mobile manipulator, business.industry, Applied Mathematics, Computational Theory and Mathematics, Robot, RGB color model, 020201 artificial intelligence & image processing, Computer Vision and Pattern Recognition, Artificial intelligence, business, Robotics (cs.RO), Encoder, Software
Abstract: We present JRDB, a novel egocentric dataset collected from our social mobile manipulator JackRabbot. The dataset includes 64 minutes of annotated multimodal sensor data including stereo cylindrical 360$^\circ$ RGB video at 15 fps, 3D point clouds from two Velodyne 16 Lidars, line 3D point clouds from two Sick Lidars, audio signal, RGB-D video at 30 fps, 360$^\circ$ spherical image from a fisheye camera and encoder values from the robot's wheels. Our dataset incorporates data from traditionally underrepresented scenes such as indoor environments and pedestrian areas, all from the ego-perspective of the robot, both stationary and navigating. The dataset has been annotated with over 2.3 million bounding boxes spread over 5 individual cameras and 1.8 million associated 3D cuboids around all people in the scenes totaling over 3500 time consistent trajectories. Together with our dataset and the annotations, we launch a benchmark and metrics for 2D and 3D person detection and tracking. With this dataset, which we plan on extending with further types of annotation in the future, we hope to provide a new source of data and a test-bench for research in the areas of egocentric robot vision, autonomous navigation, and all perceptual tasks around social robotics in human environments.
Published: 2023

2. Making Sense of Vision and Touch: Learning Multimodal Representations for Contact-Rich Tasks

Author: Peter A. Zachares, Michelle A. Lee, Matthew Tan, Animesh Garg, Krishnan Srinivasan, Jeannette Bohg, Silvio Savarese, Yuke Zhu, and Li Fei-Fei
Subjects: FOS: Computer and information sciences, Computer Science::Machine Learning, Computer Science - Machine Learning, 0209 industrial biotechnology, Computer science, 02 engineering and technology, Machine Learning (cs.LG), Computer Science Applications, Visualization, Computer Science::Robotics, Computer Science - Robotics, Task (computing), 020901 industrial engineering & automation, Control and Systems Engineering, Human–computer interaction, Task analysis, Reinforcement learning, Robot, Electrical and Electronic Engineering, Representation (mathematics), Robotics (cs.RO), Feature learning, Haptic technology
Abstract: Contact-rich manipulation tasks in unstructured environments often require both haptic and visual feedback. It is non-trivial to manually design a robot controller that combines these modalities which have very different characteristics. While deep reinforcement learning has shown success in learning control policies for high-dimensional inputs, these algorithms are generally intractable to deploy on real robots due to sample complexity. In this work, we use self-supervision to learn a compact and multimodal representation of our sensory inputs, which can then be used to improve the sample efficiency of our policy learning. Evaluating our method on a peg insertion task, we show that it generalizes over varying geometries, configurations, and clearances, while being robust to external perturbations. We also systematically study different self-supervised learning objectives and representation learning architectures. Results are presented in simulation and on a physical robot., Comment: arXiv admin note: substantial text overlap with arXiv:1810.10191
Published: 2020

3. Improving Social Awareness Through DANTE: Deep Affinity Network for Clustering Conversational Interactants

Author: Sydney Thompson, Mason Swofford, Nathan Tsoi, Roberto Martín-Martín, Marynel Vázquez, John Charles Peruzzi, and Silvio Savarese
Subjects: FOS: Computer and information sciences, Computer Networks and Communications, Computer science, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Group detection, 02 engineering and technology, Machine learning, computer.software_genre, 0202 electrical engineering, electronic engineering, information engineering, 0501 psychology and cognitive sciences, Social consciousness, Cluster analysis, 050107 human factors, Clustering coefficient, Group (mathematics), business.industry, Deep learning, 05 social sciences, Social environment, Human-Computer Interaction, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Social Sciences (miscellaneous)
Abstract: We propose a data-driven approach to detect conversational groups by identifying spatial arrangements typical of these focused social encounters. Our approach uses a novel Deep Affinity Network (DANTE) to predict the likelihood that two individuals in a scene are part of the same conversational group, considering their social context. The predicted pair-wise affinities are then used in a graph clustering framework to identify both small (e.g., dyads) and large groups. The results from our evaluation on multiple, established benchmarks suggest that combining powerful deep learning methods with classical clustering techniques can improve the detection of conversational groups in comparison to prior approaches. Finally, we demonstrate the practicality of our approach in a human-robot interaction scenario. Our efforts show that our work advances group detection not only in theory, but also in practice.
Published: 2020

4. Visuomotor Mechanical Search: Learning to Retrieve Target Objects in Clutter

Author: Silvio Savarese, Animesh Garg, Roberto Martín-Martín, Rohun Kulkarni, Andrey Kurenkov, Marcus Dominguez-Kuhne, and Joseph Taglic
Subjects: FOS: Computer and information sciences, 0209 industrial biotechnology, Computer Science - Machine Learning, Extremely hard, business.industry, Computer Science - Artificial Intelligence, Sample (statistics), 02 engineering and technology, 010501 environmental sciences, Object (computer science), 01 natural sciences, Outcome (probability), Machine Learning (cs.LG), Computer Science - Robotics, 020901 industrial engineering & automation, Artificial Intelligence (cs.AI), Clutter, Reinforcement learning, Computer vision, Artificial intelligence, business, Robotics (cs.RO), 0105 earth and related environmental sciences, Heap (data structure)
Abstract: When searching for objects in cluttered environments, it is often necessary to perform complex interactions in order to move occluding objects out of the way and fully reveal the object of interest and make it graspable. Due to the complexity of the physics involved and the lack of accurate models of the clutter, planning and controlling precise predefined interactions with accurate outcome is extremely hard, when not impossible. In problems where accurate (forward) models are lacking, Deep Reinforcement Learning (RL) has shown to be a viable solution to map observations (e.g. images) to good interactions in the form of close-loop visuomotor policies. However, Deep RL is sample inefficient and fails when applied directly to the problem of unoccluding objects based on images. In this work we present a novel Deep RL procedure that combines i) teacher-aided exploration, ii) a critic with privileged information, and iii) mid-level representations, resulting in sample efficient and effective learning for the problem of uncovering a target object occluded by a heap of unknown objects. Our experiments show that our approach trains faster and converges to more efficient uncovering solutions than baselines and ablations, and that our uncovering policies lead to an average improvement in the graspability of the target object, facilitating downstream retrieval applications.
Published: 2020

5. JRMOT: A Real-Time 3D Multi-Object Tracker and a New Large-Scale Dataset

Author: Patrick Goebel, Roberto Martín-Martín, Abhijeet Shenoi, JunYoung Gwak, Mihir Patel, Hamid Rezatofighi, Amir Sadeghian, and Silvio Savarese
Subjects: FOS: Computer and information sciences, 0209 industrial biotechnology, Artificial neural network, business.industry, Computer science, Computer Vision and Pattern Recognition (cs.CV), Point cloud, Computer Science - Computer Vision and Pattern Recognition, 02 engineering and technology, Kalman filter, Tracking (particle physics), Computer Science - Robotics, 020901 industrial engineering & automation, 0202 electrical engineering, electronic engineering, information engineering, Benchmark (computing), RGB color model, Robot, 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, business, Robotics (cs.RO)
Abstract: Robots navigating autonomously need to perceive and track the motion of objects and other agents in its surroundings. This information enables planning and executing robust and safe trajectories. To facilitate these processes, the motion should be perceived in 3D Cartesian space. However, most recent multi-object tracking (MOT) research has focused on tracking people and moving objects in 2D RGB video sequences. In this work we present JRMOT, a novel 3D MOT system that integrates information from RGB images and 3D point clouds to achieve real-time, state-of-the-art tracking performance. Our system is built with recent neural networks for re-identification, 2D and 3D detection and track description, combined into a joint probabilistic data-association framework within a multi-modal recursive Kalman architecture. As part of our work, we release the JRDB dataset, a novel large scale 2D+3D dataset and benchmark, annotated with over 2 million boxes and 3500 time consistent 2D+3D trajectories across 54 indoor and outdoor scenes. JRDB contains over 60 minutes of data including 360 degree cylindrical RGB video and 3D pointclouds in social settings that we use to develop, train and evaluate JRMOT. The presented 3D MOT system demonstrates state-of-the-art performance against competing methods on the popular 2D tracking KITTI benchmark and serves as first 3D tracking solution for our benchmark. Real-robot tests on our social robot JackRabbot indicate that the system is capable of tracking multiple pedestrians fast and reliably. We provide the ROS code of our tracker at https://sites.google.com/view/jrmot., 8 pages, 5 figures, 2 tables; Accepted at IROS 2020
Published: 2020

6. Multimodal Sensor Fusion with Differentiable Filters

Author: Jeannette Bohg, Brent Yi, Michelle A. Lee, Silvio Savarese, and Roberto Martín-Martín
Subjects: FOS: Computer and information sciences, 0209 industrial biotechnology, Noise (signal processing), Computer science, 02 engineering and technology, Filter (signal processing), Sensor fusion, Computer Science - Robotics, 020901 industrial engineering & automation, Robustness (computer science), 0202 electrical engineering, electronic engineering, information engineering, Leverage (statistics), 020201 artificial intelligence & image processing, Differentiable function, Algorithm, Robotics (cs.RO), Interpretability
Abstract: Leveraging multimodal information with recursive Bayesian filters improves performance and robustness of state estimation, as recursive filters can combine different modalities according to their uncertainties. Prior work has studied how to optimally fuse different sensor modalities with analytical state estimation algorithms. However, deriving the dynamics and measurement models along with their noise profile can be difficult or lead to intractable models. Differentiable filters provide a way to learn these models end-to-end while retaining the algorithmic structure of recursive filters. This can be especially helpful when working with sensor modalities that are high dimensional and have very different characteristics. In contact-rich manipulation, we want to combine visual sensing (which gives us global information) with tactile sensing (which gives us local information). In this paper, we study new differentiable filtering architectures to fuse heterogeneous sensor information. As case studies, we evaluate three tasks: two in planar pushing (simulated and real) and one in manipulating a kinematically constrained door (simulated). In extensive evaluations, we find that differentiable filters that leverage crossmodal sensor information reach comparable accuracies to unstructured LSTM models, while presenting interpretability benefits that may be important for safety-critical systems. We also release an open-source library for creating and training differentiable Bayesian filters in PyTorch, which can be found on our project website: https://sites.google.com/view/multimodalfilter, Comment: Published in IROS 2020. Updated sponsors, fixed Kalman gain typo
Published: 2020
Full Text: View/download PDF

7. IRIS: Implicit Reinforcement without Interaction at Scale for Learning Control from Offline Robot Manipulation Data

Author: Dieter Fox, Animesh Garg, Byron Boots, Silvio Savarese, Ajay Mandlekar, Li Fei-Fei, and Fabio Ramos
Subjects: FOS: Computer and information sciences, 0209 industrial biotechnology, Computer Science - Machine Learning, Computer science, Computer Science - Artificial Intelligence, 02 engineering and technology, 010501 environmental sciences, Crowdsourcing, Machine learning, computer.software_genre, 01 natural sciences, Task (project management), Machine Learning (cs.LG), Computer Science - Robotics, 020901 industrial engineering & automation, Control theory, Reinforcement, 0105 earth and related environmental sciences, business.industry, Robotics, Artificial Intelligence (cs.AI), Offline learning, Task analysis, Robot, Artificial intelligence, business, computer, Robotics (cs.RO)
Abstract: Learning from offline task demonstrations is a problem of great interest in robotics. For simple short-horizon manipulation tasks with modest variation in task instances, offline learning from a small set of demonstrations can produce controllers that successfully solve the task. However, leveraging a fixed batch of data can be problematic for larger datasets and longer-horizon tasks with greater variations. The data can exhibit substantial diversity and consist of suboptimal solution approaches. In this paper, we propose Implicit Reinforcement without Interaction at Scale (IRIS), a novel framework for learning from large-scale demonstration datasets. IRIS factorizes the control problem into a goal-conditioned low-level controller that imitates short demonstration sequences and a high-level goal selection mechanism that sets goals for the low-level and selectively combines parts of suboptimal solutions leading to more successful task completions. We evaluate IRIS across three datasets, including the RoboTurk Cans dataset collected by humans via crowdsourcing, and show that performant policies can be learned from purely offline learning. Additional results at https://sites.google.com/stanford.edu/iris/ .
Published: 2019

8. Scaling Robot Supervision to Hundreds of Hours with RoboTurk: Robotic Manipulation Dataset through Human Reasoning and Dexterity

Author: Anchit Gupta, Jonathan Booher, Animesh Garg, Max Spero, Yuke Zhu, Li Fei-Fei, Silvio Savarese, Ajay Mandlekar, and Albert Tung
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, 0209 industrial biotechnology, Data collection, Computer science, Computer Science - Human-Computer Interaction, 02 engineering and technology, Human-Computer Interaction (cs.HC), Machine Learning (cs.LG), Set (abstract data type), Computer Science - Robotics, Task (computing), 020901 industrial engineering & automation, Human–computer interaction, Data quality, 0202 electrical engineering, electronic engineering, information engineering, Robot, Leverage (statistics), 020201 artificial intelligence & image processing, Robotics (cs.RO)
Abstract: Large, richly annotated datasets have accelerated progress in fields such as computer vision and natural language processing, but replicating these successes in robotics has been challenging. While prior data collection methodologies such as self-supervision have resulted in large datasets, the data can have poor signal-to-noise ratio. By contrast, previous efforts to collect task demonstrations with humans provide better quality data, but they cannot reach the same data magnitude. Furthermore, neither approach places guarantees on the diversity of the data collected, in terms of solution strategies. In this work, we leverage and extend the RoboTurk platform to scale up data collection for robotic manipulation using remote teleoperation. The primary motivation for our platform is two-fold: (1) to address the shortcomings of prior work and increase the total quantity of manipulation data collected through human supervision by an order of magnitude without sacrificing the quality of the data and (2) to collect data on challenging manipulation tasks across several operators and observe a diverse set of emergent behaviors and solutions. We collected over 111 hours of robot manipulation data across 54 users and 3 challenging manipulation tasks in 1 week, resulting in the largest robot dataset collected via remote teleoperation. We evaluate the quality of our platform, the diversity of demonstrations in our dataset, and the utility of our dataset via quantitative and qualitative analysis. For additional results, supplementary videos, and to download our dataset, visit http://roboturk.stanford.edu/realrobotdataset ., Published at IROS 2019
Published: 2019

9. 3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera

Author: Amir Roshan Zamir, JunYoung Gwak, Zhi-Yang He, Iro Armeni, Silvio Savarese, Martin Fischer, and Jitendra Malik
Subjects: FOS: Computer and information sciences, business.industry, Computer science, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 020207 software engineering, 02 engineering and technology, Graph, Visualization, Computer Science - Robotics, 3d space, Framing (construction), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Scene graph, Polygon mesh, Computer vision, Artificial intelligence, business, Robotics (cs.RO)
Abstract: A comprehensive semantic understanding of a scene is important for many applications - but in what space should diverse semantic information (e.g., objects, scene categories, material types, texture, etc.) be grounded and what should be its structure? Aspiring to have one unified structure that hosts diverse types of semantics, we follow the Scene Graph paradigm in 3D, generating a 3D Scene Graph. Given a 3D mesh and registered panoramic images, we construct a graph that spans the entire building and includes semantics on objects (e.g., class, material, and other attributes), rooms (e.g., scene category, volume, etc.) and cameras (e.g., location, etc.), as well as the relationships among these entities. However, this process is prohibitively labor heavy if done manually. To alleviate this we devise a semi-automatic framework that employs existing detection methods and enhances them using two main constraints: I. framing of query images sampled on panoramas to maximize the performance of 2D detectors, and II. multi-view consistency enforcement across 2D detections that originate in different camera locations., Comment: ICCV 2019
Published: 2019

10. Continuous Relaxation of Symbolic Planner for One-Shot Imitation Learning

Author: De-An Huang, Animesh Garg, Silvio Savarese, Juan Carlos Niebles, Danfei Xu, Yuke Zhu, and Li Fei-Fei
Subjects: FOS: Computer and information sciences, One shot, Computer Science - Machine Learning, Computer science, business.industry, Computer Science - Artificial Intelligence, Probabilistic logic, 02 engineering and technology, 010501 environmental sciences, Imitation learning, Planner, 01 natural sciences, Machine Learning (cs.LG), Computer Science - Robotics, Artificial Intelligence (cs.AI), Symbol grounding, 0202 electrical engineering, electronic engineering, information engineering, Leverage (statistics), 020201 artificial intelligence & image processing, Artificial intelligence, business, Robotics (cs.RO), computer, 0105 earth and related environmental sciences, computer.programming_language
Abstract: We address one-shot imitation learning, where the goal is to execute a previously unseen task based on a single demonstration. While there has been exciting progress in this direction, most of the approaches still require a few hundred tasks for meta-training, which limits the scalability of the approaches. Our main contribution is to formulate one-shot imitation learning as a symbolic planning problem along with the symbol grounding problem. This formulation disentangles the policy execution from the inter-task generalization and leads to better data efficiency. The key technical challenge is that the symbol grounding is prone to error with limited training data and leads to subsequent symbolic planning failures. We address this challenge by proposing a continuous relaxation of the discrete symbolic planner that directly plans on the probabilistic outputs of the symbol grounding model. Our continuous relaxation of the planner can still leverage the information contained in the probabilistic symbol grounding and significantly improve over the baseline planner for the one-shot imitation learning tasks without using large training data., IROS 2019
Published: 2019

11. Taskonomy: Disentangling Task Transfer Learning

Author: Alexander Sax, William B. Shen, Amir Roshan Zamir, Jitendra Malik, Leonidas J. Guibas, and Silvio Savarese
Subjects: Structure (mathematical logic), Exploit, Computer science, business.industry, 0211 other engineering and technologies, 02 engineering and technology, Space (commercial competition), Solver, Machine learning, computer.software_genre, 030218 nuclear medicine & medical imaging, 03 medical and health sciences, Task (computing), 0302 clinical medicine, Use case, Artificial intelligence, Set (psychology), Transfer of learning, business, computer, 021101 geological & geomatics engineering
Abstract: Do visual tasks have a relationship, or are they unrelated? For instance, could having surface normals simplify estimating the depth of an image? Intuition answers these questions positively, implying existence of a structure among visual tasks. Knowing this structure has notable values; it is the concept underlying transfer learning and provides a principled way for identifying redundancies across tasks, e.g., to seamlessly reuse supervision among related tasks or solve many tasks in one system without piling up the complexity. We proposes a fully computational approach for modeling the structure of space of visual tasks. This is done via finding (first and higher-order) transfer learning dependencies across a dictionary of twenty six 2D, 2.5D, 3D, and semantic tasks in a latent space. The product is a computational taxonomic map for task transfer learning. We study the consequences of this structure, e.g. nontrivial emerged relationships, and exploit them to reduce the demand for labeled data. For example, we show that the total number of labeled datapoints needed for solving a set of 10 tasks can be reduced by roughly 2/3 (compared to training independently) while keeping the performance nearly the same. We provide a set of tools for computing and probing this taxonomical structure including a solver that users can employ to devise efficient supervision policies for their use cases.
Published: 2019

12. Variable Impedance Control in End-Effector Space: An Action Space for Reinforcement Learning in Contact-Rich Tasks

Author: Silvio Savarese, Animesh Garg, Michelle A. Lee, Rachel Gardner, Jeannette Bohg, and Roberto Martín-Martín
Subjects: FOS: Computer and information sciences, 0209 industrial biotechnology, Computer Science - Machine Learning, Computer science, Computer Science - Artificial Intelligence, Control engineering, 02 engineering and technology, Kinematics, Robot end effector, Machine Learning (cs.LG), law.invention, Task (project management), 03 medical and health sciences, Computer Science - Robotics, Artificial Intelligence (cs.AI), 020901 industrial engineering & automation, 0302 clinical medicine, Action (philosophy), Impedance control, Robustness (computer science), law, Robot, Reinforcement learning, Robotics (cs.RO), 030217 neurology & neurosurgery
Abstract: Reinforcement Learning (RL) of contact-rich manipulation tasks has yielded impressive results in recent years. While many studies in RL focus on varying the observation space or reward model, few efforts focused on the choice of action space (e.g. joint or end-effector space, position, velocity, etc.). However, studies in robot motion control indicate that choosing an action space that conforms to the characteristics of the task can simplify exploration and improve robustness to disturbances. This paper studies the effect of different action spaces in deep RL and advocates for Variable Impedance Control in End-effector Space (VICES) as an advantageous action space for constrained and contact-rich tasks. We evaluate multiple action spaces on three prototypical manipulation tasks: Path Following (task with no contact), Door Opening (task with kinematic constraints), and Surface Wiping (task with continuous contact). We show that VICES improves sample efficiency, maintains low energy consumption, and ensures safety across all three experimental setups. Further, RL policies learned with VICES can transfer across different robot models in simulation, and from simulation to real for the same robot. Further information is available at https://stanfordvl.github.io/vices., IROS19
Published: 2019

13. TopNet: Structural Point Cloud Decoder

Author: Vineet Kosaraju, Silvio Savarese, Hamid Rezatofighi, Lyne P. Tchapmi, and Ian Reid
Subjects: Structure (mathematical logic), Theoretical computer science, business.industry, Computer science, Point cloud, 020207 software engineering, Cloud computing, 02 engineering and technology, Point group, Object (computer science), Manifold, Set (abstract data type), Tree structure, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Geometric primitive, Artificial intelligence, business
Abstract: 3D point cloud generation is of great use for 3D scene modeling and understanding. Real-world 3D object point clouds can be properly described by a collection of low-level and high-level structures such as surfaces, geometric primitives, semantic parts,etc. In fact, there exist many different representations of a 3D object point cloud as a set of point groups. Existing frameworks for point cloud genera-ion either do not consider structure in their proposed solutions, or assume and enforce a specific structure/topology,e.g. a collection of manifolds or surfaces, for the generated point cloud of a 3D object. In this work, we pro-pose a novel decoder that generates a structured point cloud without assuming any specific structure or topology on the underlying point set. Our decoder is softly constrained to generate a point cloud following a hierarchical rooted tree structure. We show that given enough capacity and allowing for redundancies, the proposed decoder is very flexible and able to learn any arbitrary grouping of points including any topology on the point set. We evaluate our decoder on the task of point cloud generation for 3D point cloud shape completion. Combined with encoders from existing frameworks, we show that our proposed decoder significantly outperforms state-of-the-art 3D point cloud completion methods on the Shapenet dataset
Published: 2019

14. Deep Local Trajectory Replanning and Control for Robot Navigation

Author: Roberto Martín-Martín, Silvio Savarese, Patrick Goebel, Junwei Yang, Hans M. Ewald, Marynel Vázquez, Amir Sadeghian, Dorsa Sadigh, Vincent Chow, Ashwini Pokle, and Zhenkai Wang
Subjects: Structure (mathematical logic), FOS: Computer and information sciences, 0209 industrial biotechnology, Computer science, Computer Science - Artificial Intelligence, Real-time computing, Control (management), Navigation system, 020207 software engineering, 02 engineering and technology, Plan (drawing), Motion (physics), Computer Science - Robotics, 020901 industrial engineering & automation, Artificial Intelligence (cs.AI), 0202 electrical engineering, electronic engineering, information engineering, Trajectory, Robot, Robotics (cs.RO)
Abstract: We present a navigation system that combines ideas from hierarchical planning and machine learning. The system uses a traditional global planner to compute optimal paths towards a goal, and a deep local trajectory planner and velocity controller to compute motion commands. The latter components of the system adjust the behavior of the robot through attention mechanisms such that it moves towards the goal, avoids obstacles, and respects the space of nearby pedestrians. Both the structure of the proposed deep models and the use of attention mechanisms make the system's execution interpretable. Our simulation experiments suggest that the proposed architecture outperforms baselines that try to map global plan information and sensor data directly to velocity commands. In comparison to a hand-designed traditional navigation system, the proposed approach showed more consistent performance.
Published: 2019

15. Mechanical Search: Multi-Step Retrieval of a Target Object Occluded by Clutter

Author: Roberto Martín-Martín, Andrey Kurenkov, Ken Goldberg, Animesh Garg, Silvio Savarese, Ashwin Balakrishna, David Wang, Michael Danielczuk, and Matthew Matl
Subjects: FOS: Computer and information sciences, 0209 industrial biotechnology, Computer science, business.industry, GRASP, 02 engineering and technology, Image segmentation, Visualization, Computer Science - Robotics, 020901 industrial engineering & automation, 0202 electrical engineering, electronic engineering, information engineering, Task analysis, Robot, Clutter, 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, business, Robotics (cs.RO), Heap (data structure)
Abstract: When operating in unstructured environments such as warehouses, homes, and retail centers, robots are frequently required to interactively search for and retrieve specific objects from cluttered bins, shelves, or tables. Mechanical Search describes the class of tasks where the goal is to locate and extract a known target object. In this paper, we formalize Mechanical Search and study a version where distractor objects are heaped over the target object in a bin. The robot uses an RGBD perception system and control policies to iteratively select, parameterize, and perform one of 3 actions -- push, suction, grasp -- until the target object is extracted, or either a time limit is exceeded, or no high confidence push or grasp is available. We present a study of 5 algorithmic policies for mechanical search, with 15,000 simulated trials and 300 physical trials for heaps ranging from 10 to 20 objects. Results suggest that success can be achieved in this long-horizon task with algorithmic policies in over 95% of instances and that the number of actions required scales approximately linearly with the size of the heap. Code and supplementary material can be found at http://ai.stanford.edu/mech-search ., Comment: To appear in IEEE International Conference on Robotics and Automation (ICRA), 2019. 9 pages with 4 figures
Published: 2019

16. Scene Memory Transformer for Embodied Agents in Long-Horizon Tasks

Author: Silvio Savarese, Alexander Toshev, Kuan Fang, and Li Fei-Fei
Subjects: FOS: Computer and information sciences, 0209 industrial biotechnology, Computer Science - Machine Learning, Computer science, business.industry, Computer Vision and Pattern Recognition (cs.CV), Deep learning, Computer Science - Computer Vision and Pattern Recognition, Machine Learning (stat.ML), 02 engineering and technology, Memorization, Machine Learning (cs.LG), Computer Science - Robotics, 020901 industrial engineering & automation, Embodied cognition, Statistics - Machine Learning, 0202 electrical engineering, electronic engineering, information engineering, Reinforcement learning, 020201 artificial intelligence & image processing, Artificial intelligence, business, Robotics (cs.RO), Transformer (machine learning model)
Abstract: Many robotic applications require the agent to perform long-horizon tasks in partially observable environments. In such applications, decision making at any step can depend on observations received far in the past. Hence, being able to properly memorize and utilize the long-term history is crucial. In this work, we propose a novel memory-based policy, named Scene Memory Transformer (SMT). The proposed policy embeds and adds each observation to a memory and uses the attention mechanism to exploit spatio-temporal dependencies. This model is generic and can be efficiently trained with reinforcement learning over long episodes. On a range of visual navigation tasks, SMT demonstrates superior performance to existing reactive and memory-based policies by a margin., CVPR 2019 paper with supplementary material
Published: 2019

17. 6-PACK: Category-level 6D Pose Tracker with Anchor-Based Keypoints

Author: Li Fei-Fei, Chen Wang, Jun Lv, Danfei Xu, Yuke Zhu, Silvio Savarese, Cewu Lu, and Roberto Martín-Martín
Subjects: FOS: Computer and information sciences, 0209 industrial biotechnology, Computer science, business.industry, Deep learning, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Inter frame, 02 engineering and technology, Visualization, Computer Science - Robotics, 020901 industrial engineering & automation, Robustness (computer science), 0202 electrical engineering, electronic engineering, information engineering, RGB color model, Robot, 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, business, Pose, Robotics (cs.RO)
Abstract: We present 6-PACK, a deep learning approach to category-level 6D object pose tracking on RGB-D data. Our method tracks in real-time novel object instances of known object categories such as bowls, laptops, and mugs. 6-PACK learns to compactly represent an object by a handful of 3D keypoints, based on which the interframe motion of an object instance can be estimated through keypoint matching. These keypoints are learned end-to-end without manual supervision in order to be most effective for tracking. Our experiments show that our method substantially outperforms existing methods on the NOCS category-level 6D pose estimation benchmark and supports a physical robot to perform simple vision-based closed-loop manipulation tasks. Our code and video are available at https://sites.google.com/view/6packtracking.
Published: 2019
Full Text: View/download PDF

18. Situational Fusion of Visual Representation for Visual Navigation

Author: Yuke Zhu, Leonidas J. Guibas, Li Fei-Fei, Danfei Xu, William B. Shen, and Silvio Savarese
Subjects: FOS: Computer and information sciences, Visual perception, business.industry, Computer science, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Representation (systemics), 02 engineering and technology, 010501 environmental sciences, 01 natural sciences, Task (project management), Visualization, Action (philosophy), Human–computer interaction, 0202 electrical engineering, electronic engineering, information engineering, Task analysis, 020201 artificial intelligence & image processing, Artificial intelligence, Vanishing point, business, 0105 earth and related environmental sciences
Abstract: A complex visual navigation task puts an agent in different situations which call for a diverse range of visual perception abilities. For example, to "go to the nearest chair'', the agent might need to identify a chair in a living room using semantics, follow along a hallway using vanishing point cues, and avoid obstacles using depth. Therefore, utilizing the appropriate visual perception abilities based on a situational understanding of the visual environment can empower these navigation models in unseen visual environments. We propose to train an agent to fuse a large set of visual representations that correspond to diverse visual perception abilities. To fully utilize each representation, we develop an action-level representation fusion scheme, which predicts an action candidate from each representation and adaptively consolidate these action candidates into the final action. Furthermore, we employ a data-driven inter-task affinity regularization to reduce redundancies and improve generalization. Our approach leads to a significantly improved performance in novel environments over ImageNet-pretrained baseline and other fusion methods.
Published: 2019
Full Text: View/download PDF

19. DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion

Author: Roberto Martín-Martín, Li Fei-Fei, Yuke Zhu, Silvio Savarese, Danfei Xu, Cewu Lu, and Chen Wang
Subjects: FOS: Computer and information sciences, 0209 industrial biotechnology, Computer science, business.industry, Deep learning, Computer Vision and Pattern Recognition (cs.CV), GRASP, Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 02 engineering and technology, Object (computer science), Computer Science - Robotics, 020901 industrial engineering & automation, Feature (computer vision), 0202 electrical engineering, electronic engineering, information engineering, Leverage (statistics), Robot, RGB color model, 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, business, Pose, Robotics (cs.RO)
Abstract: A key technical challenge in performing 6D object pose estimation from RGB-D image is to fully leverage the two complementary data sources. Prior works either extract information from the RGB image and depth separately or use costly post-processing steps, limiting their performances in highly cluttered scenes and real-time applications. In this work, we present DenseFusion, a generic framework for estimating 6D pose of a set of known objects from RGB-D images. DenseFusion is a heterogeneous architecture that processes the two data sources individually and uses a novel dense fusion network to extract pixel-wise dense feature embedding, from which the pose is estimated. Furthermore, we integrate an end-to-end iterative pose refinement procedure that further improves the pose estimation while achieving near real-time inference. Our experiments show that our method outperforms state-of-the-art approaches in two datasets, YCB-Video and LineMOD. We also deploy our proposed method to a real robot to grasp and manipulate objects based on the estimated pose.
Published: 2019
Full Text: View/download PDF

20. Text2Shape: Generating Shapes from Natural Language by Learning Joint Embeddings

Author: Angel X. Chang, Thomas Funkhouser, Christopher Choy, Manolis Savva, Kevin Chen, and Silvio Savarese
Subjects: business.industry, Computer science, Association (object-oriented programming), ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 020207 software engineering, Pattern recognition, 02 engineering and technology, 010501 environmental sciences, Texture (music), 01 natural sciences, Colored, Metric (mathematics), 0202 electrical engineering, electronic engineering, information engineering, Embedding, Artificial intelligence, business, Representation (mathematics), Joint (audio engineering), Natural language, ComputingMethodologies_COMPUTERGRAPHICS, 0105 earth and related environmental sciences
Abstract: We present a method for generating colored 3D shapes from natural language. To this end, we first learn joint embeddings of freeform text descriptions and colored 3D shapes. Our model combines and extends learning by association and metric learning approaches to learn implicit cross-modal connections, and produces a joint representation that captures the many-to-many relations between language and physical properties of 3D shapes such as color and shape. To evaluate our approach, we collect a large dataset of natural language descriptions for physical 3D objects in the ShapeNet dataset. With this learned joint embedding we demonstrate text-to-shape retrieval that outperforms baseline approaches. Using our embeddings with a novel conditional Wasserstein GAN framework, we generate colored 3D shapes from text. Our method is the first to connect natural language text with realistic 3D objects exhibiting rich variations in color, texture, and shape detail.
Published: 2019

21. 4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks

Author: Silvio Savarese, JunYoung Gwak, and Christopher Choy
Subjects: Conditional random field, FOS: Computer and information sciences, Computer science, business.industry, Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 020207 software engineering, Pattern recognition, 02 engineering and technology, Image segmentation, Convolutional neural network, Convolution, Artificial Intelligence (cs.AI), Margin (machine learning), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Segmentation, Tensor, Hybrid kernel, Noise (video), Artificial intelligence, business
Abstract: In many robotics and VR/AR applications, 3D-videos are readily-available sources of input (a continuous sequence of depth images, or LIDAR scans). However, those 3D-videos are processed frame-by-frame either through 2D convnets or 3D perception algorithms. In this work, we propose 4-dimensional convolutional neural networks for spatio-temporal perception that can directly process such 3D-videos using high-dimensional convolutions. For this, we adopt sparse tensors and propose the generalized sparse convolution that encompasses all discrete convolutions. To implement the generalized sparse convolution, we create an open-source auto-differentiation library for sparse tensors that provides extensive functions for high-dimensional convolutional neural networks. We create 4D spatio-temporal convolutional neural networks using the library and validate them on various 3D semantic segmentation benchmarks and proposed 4D datasets for 3D-video perception. To overcome challenges in the 4D space, we propose the hybrid kernel, a special case of the generalized sparse convolution, and the trilateral-stationary conditional random field that enforces spatio-temporal consistency in the 7D space-time-chroma space. Experimentally, we show that convolutional neural networks with only generalized 3D sparse convolutions can outperform 2D or 2D-3D hybrid methods by a large margin. Also, we show that on 3D-videos, 4D spatio-temporal convolutional neural networks are robust to noise, outperform 3D convolutional neural networks and are faster than the 3D counterpart in some cases., Comment: CVPR'19
Published: 2019
Full Text: View/download PDF

22. KETO: Learning Keypoint Representations for Tool Manipulation

Author: Silvio Savarese, Yuke Zhu, Li Fei-Fei, Kuan Fang, and Zengyi Qin
Subjects: FOS: Computer and information sciences, 0209 industrial biotechnology, Artificial neural network, business.industry, Computer science, 05 social sciences, 02 engineering and technology, Object (computer science), Machine learning, computer.software_genre, 050105 experimental psychology, Visualization, Task (project management), Computer Science - Robotics, 020901 industrial engineering & automation, Task analysis, Robot, 0501 psychology and cognitive sciences, Artificial intelligence, business, Set (psychology), Representation (mathematics), computer, Robotics (cs.RO)
Abstract: We aim to develop an algorithm for robots to manipulate novel objects as tools for completing different task goals. An efficient and informative representation would facilitate the effectiveness and generalization of such algorithms. For this purpose, we present KETO, a framework of learning keypoint representations of tool-based manipulation. For each task, a set of task-specific keypoints is jointly predicted from 3D point clouds of the tool object by a deep neural network. These keypoints offer a concise and informative description of the object to determine grasps and subsequent manipulation actions. The model is learned from self-supervised robot interactions in the task environment without the need for explicit human annotations. We evaluate our framework in three manipulation tasks with tool use. Our model consistently outperforms state-of-the-art methods in terms of task success rates. Qualitative results of keypoint prediction and tool generation are shown to visualize the learned representations.
Published: 2019
Full Text: View/download PDF

23. Machine vision for natural gas methane emissions detection using an infrared camera

Author: Jingfan Wang, Mike McGuire, Lyne P. Tchapmi, Daniel Zimmerle, Silvio Savarese, Adam R. Brandt, Arvind P. Ravikumar, and Clay S. Bell
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Leak, Offset (computer science), Computer science, Machine vision, Computer Vision and Pattern Recognition (cs.CV), 020209 energy, Computer Science - Computer Vision and Pattern Recognition, 02 engineering and technology, Management, Monitoring, Policy and Law, Convolutional neural network, Methane, Machine Learning (cs.LG), chemistry.chemical_compound, 020401 chemical engineering, Natural gas, FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, Computer vision, 0204 chemical engineering, Background subtraction, business.industry, Mechanical Engineering, Deep learning, Image and Video Processing (eess.IV), Building and Construction, Electrical Engineering and Systems Science - Image and Video Processing, General Energy, chemistry, Artificial intelligence, business
Abstract: It is crucial to reduce natural gas methane emissions, which can potentially offset the climate benefits of replacing coal with gas. Optical gas imaging (OGI) is a widely-used method to detect methane leaks, but is labor-intensive and cannot provide leak detection results without operators' judgment. In this paper, we develop a computer vision approach to OGI-based leak detection using convolutional neural networks (CNN) trained on methane leak images to enable automatic detection. First, we collect ~1 M frames of labeled video of methane leaks from different leaking equipment for building CNN model, covering a wide range of leak sizes (5.3-2051.6 gCH4/h) and imaging distances (4.6-15.6 m). Second, we examine different background subtraction methods to extract the methane plume in the foreground. Third, we then test three CNN model variants, collectively called GasNet, to detect plumes in videos taken at other pieces of leaking equipment. We assess the ability of GasNet to perform leak detection by comparing it to a baseline method that uses optical-flow based change detection algorithm. We explore the sensitivity of results to the CNN structure, with a moderate-complexity variant performing best across distances. We find that the detection accuracy can reach as high as 99%, the overall detection accuracy can exceed 95% for a case across all leak sizes and imaging distances. Binary detection accuracy exceeds 97% for large leaks (~710 gCH4/h) imaged closely (~5-7 m). At closer imaging distances (~5-10 m), CNN-based models have greater than 94% accuracy across all leak sizes. At farthest distances (~13-16 m), performance degrades rapidly, but it can achieve above 95% accuracy to detect large leaks (>950 gCH4/h). The GasNet-based computer vision approach could be deployed in OGI surveys to allow automatic vigilance of methane leak detection with high detection accuracy in the real world., This paper was submitted to Applied Energy
Published: 2020

24. GONet: A Semi-Supervised Deep Learning Approach For Traversability Estimation

Author: Amir Sadeghian, Silvio Savarese, Noriaki Hirose, Marynel Vázquez, and Patrick Goebel
Subjects: FOS: Computer and information sciences, 0209 industrial biotechnology, Traverse, business.industry, Computer science, Computer Vision and Pattern Recognition (cs.CV), Deep learning, Computer Science - Computer Vision and Pattern Recognition, Mobile robot, 02 engineering and technology, Machine learning, computer.software_genre, Machine Learning (cs.LG), Computer Science - Robotics, Computer Science - Learning, 020901 industrial engineering & automation, 0202 electrical engineering, electronic engineering, information engineering, Leverage (statistics), Robot, 020201 artificial intelligence & image processing, Artificial intelligence, business, Robotics (cs.RO), computer
Abstract: We present semi-supervised deep learning approaches for traversability estimation from fisheye images. Our method, GONet, and the proposed extensions leverage Generative Adversarial Networks (GANs) to effectively predict whether the area seen in the input image(s) is safe for a robot to traverse. These methods are trained with many positive images of traversable places, but just a small set of negative images depicting blocked and unsafe areas. This makes the proposed methods practical. Positive examples can be collected easily by simply operating a robot through traversable spaces, while obtaining negative examples is time consuming, costly, and potentially dangerous. Through extensive experiments and several demonstrations, we show that the proposed traversability estimation approaches are robust and can generalize to unseen scenarios. Further, we demonstrate that our methods are memory efficient and fast, allowing for real-time operation on a mobile robot with single or stereo fisheye cameras. As part of our contributions, we open-source two new datasets for traversability estimation. These datasets are composed of approximately 24h of videos from more than 25 indoor environments. Our methods outperform baseline approaches for traversability estimation on these new datasets., Comment: 8 pages, 7 figures, 3 tables
Published: 2018

25. VUNet: Dynamic Scene View Synthesis for Traversability Estimation using an RGB Camera

Author: Amir Sadeghian, Noriaki Hirose, Roberto Martín-Martín, Silvio Savarese, and Fei Xia
Subjects: FOS: Computer and information sciences, 0209 industrial biotechnology, Control and Optimization, Computer science, Computer Vision and Pattern Recognition (cs.CV), Biomedical Engineering, Computer Science - Computer Vision and Pattern Recognition, 02 engineering and technology, 010501 environmental sciences, 01 natural sciences, Computer Science - Robotics, 020901 industrial engineering & automation, Artificial Intelligence, Computer vision, 0105 earth and related environmental sciences, Network architecture, business.industry, Mechanical Engineering, Deep learning, Mobile robot, Computer Science Applications, View synthesis, Human-Computer Interaction, Control and Systems Engineering, Virtual image, Teleoperation, Robot, RGB color model, Computer Vision and Pattern Recognition, Artificial intelligence, business, Robotics (cs.RO)
Abstract: We present VUNet, a novel view(VU) synthesis method for mobile robots in dynamic environments, and its application to the estimation of future traversability. Our method predicts future images for given virtual robot velocity commands using only RGB images at previous and current time steps. The future images result from applying two types of image changes to the previous and current images: 1) changes caused by different camera pose, and 2) changes due to the motion of the dynamic obstacles. We learn to predict these two types of changes disjointly using two novel network architectures, SNet and DNet. We combine SNet and DNet to synthesize future images that we pass to our previously presented method GONet to estimate the traversable areas around the robot. Our quantitative and qualitative evaluation indicate that our approach for view synthesis predicts accurate future images in both static and dynamic environments. We also show that these virtual images can be used to estimate future traversability correctly. We apply our view synthesis-based traversability estimation method to two applications for assisted teleoperation., website: http://svl.stanford.edu/projects/vunet/
Published: 2018

26. Demo2Vec: Reasoning Object Affordances from Online Videos

Author: Silvio Savarese, Te-Lin Wu, Joseph J. Lim, Kuan Fang, and Daniel Yang
Subjects: Computer science, business.industry, 02 engineering and technology, 010501 environmental sciences, Object (computer science), 01 natural sciences, Recurrent neural network, Action (philosophy), Feature (computer vision), Human–computer interaction, 0202 electrical engineering, electronic engineering, information engineering, Robot, 020201 artificial intelligence & image processing, Artificial intelligence, business, Affordance, 0105 earth and related environmental sciences
Abstract: Watching expert demonstrations is an important way for humans and robots to reason about affordances of unseen objects. In this paper, we consider the problem of reasoning object affordances through the feature embedding of demonstration videos. We design the Demo2Vec model which learns to extract embedded vectors of demonstration videos and predicts the interaction region and the action label on a target image of the same object. We introduce the Online Product Review dataset for Affordance (OPRA) by collecting and labeling diverse YouTube product review videos. Our Demo2Vec model outperforms various recurrent neural network baselines on the collected dataset.
Published: 2018

27. Adversarial Feature Augmentation for Unsupervised Domain Adaptation

Author: Riccardo Volpi, Silvio Savarese, Vittorio Murino, and Pietro Morerio
Subjects: FOS: Computer and information sciences, Artificial neural network, Computer science, business.industry, Computer Vision and Pattern Recognition (cs.CV), Feature vector, Deep learning, Feature extraction, Computer Science - Computer Vision and Pattern Recognition, Pattern recognition, 02 engineering and technology, 010501 environmental sciences, Minimax, 01 natural sciences, Image (mathematics), Feature (computer vision), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, business, 0105 earth and related environmental sciences, Generator (mathematics)
Abstract: Recent works showed that Generative Adversarial Networks (GANs) can be successfully applied in unsupervised domain adaptation, where, given a labeled source dataset and an unlabeled target dataset, the goal is to train powerful classifiers for the target samples. In particular, it was shown that a GAN objective function can be used to learn target features indistinguishable from the source ones. In this work, we extend this framework by (i) forcing the learned feature extractor to be domain-invariant, and (ii) training it through data augmentation in the feature space, namely performing feature augmentation. While data augmentation in the image space is a well established technique in deep learning, feature augmentation has not yet received the same level of attention. We accomplish it by means of a feature generator trained by playing the GAN minimax game against source features. Results show that both enforcing domain-invariance and performing feature augmentation lead to superior or comparable performance to state-of-the-art results in several unsupervised domain adaptation benchmarks., Comment: Accepted to CVPR 2018
Published: 2018

28. Gibson Env: Real-World Perception for Embodied Agents

Author: Zhi-Yang He, Amir Roshan Zamir, Jitendra Malik, Fei Xia, Silvio Savarese, and Alexander Sax
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, 0209 industrial biotechnology, Visual perception, Computer Science - Artificial Intelligence, Computer science, Computer Vision and Pattern Recognition (cs.CV), media_common.quotation_subject, Computer Science - Computer Vision and Pattern Recognition, 02 engineering and technology, Space (commercial competition), Machine Learning (cs.LG), Computer Science - Robotics, Computer Science - Graphics, 020901 industrial engineering & automation, Human–computer interaction, Perception, 0202 electrical engineering, electronic engineering, information engineering, Set (psychology), media_common, Artificial neural network, business.industry, Graphics (cs.GR), Visualization, Artificial Intelligence (cs.AI), Embodied cognition, Robot, 020201 artificial intelligence & image processing, Artificial intelligence, business, Robotics (cs.RO)
Abstract: Developing visual perception models for active agents and sensorimotor control are cumbersome to be done in the physical world, as existing algorithms are too slow to efficiently learn in real-time and robots are fragile and costly. This has given rise to learning-in-simulation which consequently casts a question on whether the results transfer to real-world. In this paper, we are concerned with the problem of developing real-world perception for active agents, propose Gibson Virtual Environment for this purpose, and showcase sample perceptual tasks learned therein. Gibson is based on virtualizing real spaces, rather than using artificially designed ones, and currently includes over 1400 floor spaces from 572 full buildings. The main characteristics of Gibson are: I. being from the real-world and reflecting its semantic complexity, II. having an internal synthesis mechanism, "Goggles", enabling deploying the trained models in real-world without needing further domain adaptation, III. embodiment of agents and making them subject to constraints of physics and space., Comment: Access the code, dataset, and project website at http://gibsonenv.vision/ . CVPR 2018
Published: 2018

29. Taskonomy: Disentangling Task Transfer Learning

Author: Leonidas J. Guibas, Jitendra Malik, Alexander Sax, William B. Shen, Amir Roshan Zamir, and Silvio Savarese
Subjects: FOS: Computer and information sciences, Computer Science - Artificial Intelligence, Computer science, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, 02 engineering and technology, 010501 environmental sciences, Reuse, Machine learning, computer.software_genre, 01 natural sciences, Machine Learning (cs.LG), Computer Science - Robotics, 0202 electrical engineering, electronic engineering, information engineering, Neural and Evolutionary Computing (cs.NE), 0105 earth and related environmental sciences, business.industry, Computer Science - Neural and Evolutionary Computing, Solver, Computer Science - Learning, Artificial Intelligence (cs.AI), Task analysis, Labeled data, 020201 artificial intelligence & image processing, Artificial intelligence, Transfer of learning, business, Robotics (cs.RO), computer, Intuition
Abstract: Do visual tasks have a relationship, or are they unrelated? For instance, could having surface normals simplify estimating the depth of an image? Intuition answers these questions positively, implying existence of a structure among visual tasks. Knowing this structure has notable values; it is the concept underlying transfer learning and provides a principled way for identifying redundancies across tasks, e.g., to seamlessly reuse supervision among related tasks or solve many tasks in one system without piling up the complexity. We proposes a fully computational approach for modeling the structure of space of visual tasks. This is done via finding (first and higher-order) transfer learning dependencies across a dictionary of twenty six 2D, 2.5D, 3D, and semantic tasks in a latent space. The product is a computational taxonomic map for task transfer learning. We study the consequences of this structure, e.g. nontrivial emerged relationships, and exploit them to reduce the demand for labeled data. For example, we show that the total number of labeled datapoints needed for solving a set of 10 tasks can be reduced by roughly 2/3 (compared to training independently) while keeping the performance nearly the same. We provide a set of tools for computing and probing this taxonomical structure including a solver that users can employ to devise efficient supervision policies for their use cases., CVPR 2018 (Oral). See project website and live demos at http://taskonomy.vision/
Published: 2018

30. Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks

Author: Li Fei-Fei, Alexandre Alahi, Silvio Savarese, Agrim Gupta, and Justin Johnson
Subjects: Trajectory prediction, FOS: Computer and information sciences, 050210 logistics & transportation, Social robot, Computer science, business.industry, Computer Vision and Pattern Recognition (cs.CV), Deep learning, 05 social sciences, Pooling, Generative models, Computer Science - Computer Vision and Pattern Recognition, Mobile robot, 02 engineering and technology, Motion (physics), Variety (cybernetics), Motion estimation, 0502 economics and business, 0202 electrical engineering, electronic engineering, information engineering, Unsupervised learning, 020201 artificial intelligence & image processing, Artificial intelligence, Forecasting models, business
Abstract: Understanding human motion behavior is critical for autonomous moving platforms (like self-driving cars and social robots) if they are to navigate human-centric environments. This is challenging because human motion is inherently multimodal: given a history of human motion paths, there are many socially plausible ways that people could move in the future. We tackle this problem by combining tools from sequence prediction and generative adversarial networks: a recurrent sequence-to-sequence model observes motion histories and predicts future behavior, using a novel pooling mechanism to aggregate information across people. We predict socially plausible futures by training adversarially against a recurrent discriminator, and encourage diverse predictions with a novel variety loss. Through experiments on several datasets we demonstrate that our approach outperforms prior work in terms of accuracy, variety, collision avoidance, and computational complexity.
Published: 2018

31. Im2Pano3D: Extrapolating 360° Structure and Semantics Beyond the Field of View

Author: Manolis Savva, Angel X. Chang, Silvio Savarese, Andy Zeng, Thomas Funkhouser, and Shuran Song
Subjects: Pixel, Computer science, business.industry, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 020207 software engineering, Pattern recognition, Context (language use), 02 engineering and technology, 010501 environmental sciences, Semantics, 01 natural sciences, Convolutional neural network, Consistency (database systems), Encoding (memory), 0202 electrical engineering, electronic engineering, information engineering, Probability distribution, RGB color model, Artificial intelligence, business, 0105 earth and related environmental sciences
Abstract: We present Im2Pano3D, a convolutional neural network that generates a dense prediction of 3D structure and a probability distribution of semantic labels for a full 360° panoramic view of an indoor scene when given only a partial observation (= 50%) in the form of an RGB-D image. To make this possible, Im2Pano3D leverages strong contextual priors learned from large-scale synthetic and real-world indoor scenes. To ease the prediction of 3D structure, we propose to parameterize 3D surfaces with their plane equations and train the model to predict these parameters directly. To provide meaningful training supervision, we use multiple loss functions that consider both pixel level accuracy and global context consistency. Experiments demonstrate that Im2Pano3D is able to predict the semantics and 3D structure of the unobserved scene with more than 56% pixel accuracy and less than 0.52m average distance error, which is significantly better than alternative approaches.
Published: 2018

32. Deep Learning under Privileged Information Using Heteroscedastic Dropout

Author: John Lambert, Ozan Sener, and Silvio Savarese
Subjects: FOS: Computer and information sciences, Contextual image classification, Computer science, business.industry, Deep learning, Machine Learning (stat.ML), 02 engineering and technology, Variance (accounting), 010501 environmental sciences, 01 natural sciences, Convolutional neural network, Generalization error, Machine Learning (cs.LG), Support vector machine, Computer Science - Learning, Recurrent neural network, Margin (machine learning), Statistics - Machine Learning, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, business, Dropout (neural networks), 0105 earth and related environmental sciences
Abstract: Unlike machines, humans learn through rapid, abstract model-building. The role of a teacher is not simply to hammer home right or wrong answers, but rather to provide intuitive comments, comparisons, and explanations to a pupil. This is what the Learning Under Privileged Information (LUPI) paradigm endeavors to model by utilizing extra knowledge only available during training. We propose a new LUPI algorithm specifically designed for Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). We propose to use a heteroscedastic dropout (i.e. dropout with a varying variance) and make the variance of the dropout a function of privileged information. Intuitively, this corresponds to using the privileged information to control the uncertainty of the model output. We perform experiments using CNNs and RNNs for the tasks of image classification and machine translation. Our method significantly increases the sample efficiency during learning, resulting in higher accuracy with a large margin when the number of training examples is limited. We also theoretically justify the gains in sample efficiency by providing a generalization error bound decreasing with $O(\frac{1}{n})$, where $n$ is the number of training examples, in an oracle case., CVPR 2018
Published: 2018

33. Neural Task Programming: Learning to Generalize Across Hierarchical Tasks

Author: Danfei Xu, Silvio Savarese, Suraj Nair, Julian Gao, Animesh Garg, Yuke Zhu, and Li Fei-Fei
Subjects: FOS: Computer and information sciences, 0209 industrial biotechnology, Computer Science - Artificial Intelligence, Computer science, Generalization, business.industry, Semantics (computer science), 02 engineering and technology, Robot learning, Machine Learning (cs.LG), Data modeling, Computer Science - Learning, Computer Science - Robotics, Task (computing), Artificial Intelligence (cs.AI), 020901 industrial engineering & automation, 0202 electrical engineering, electronic engineering, information engineering, Task analysis, Robot, 020201 artificial intelligence & image processing, Artificial intelligence, business, Robotics (cs.RO)
Abstract: In this work, we propose a novel robot learning framework called Neural Task Programming (NTP), which bridges the idea of few-shot learning from demonstration and neural program induction. NTP takes as input a task specification (e.g., video demonstration of a task) and recursively decomposes it into finer sub-task specifications. These specifications are fed to a hierarchical neural program, where bottom-level programs are callable subroutines that interact with the environment. We validate our method in three robot manipulation tasks. NTP achieves strong generalization across sequential tasks that exhibit hierarchal and compositional structures. The experimental results show that NTP learns to generalize well to- wards unseen tasks with increasing lengths, variable topologies, and changing objectives., Comment: ICRA 2018
Published: 2018

34. Behavioral Indoor Navigation With Natural Language Directions

Author: Alvaro Soto, Xiaoxue Zang, Silvio Savarese, Marynel Vázquez, and Juan Carlos Niebles
Subjects: Structure (mathematical logic), 0209 industrial biotechnology, Sequence, Computer science, 05 social sciences, Geometric representation, 02 engineering and technology, 050105 experimental psychology, Human–robot interaction, Nonverbal communication, 020901 industrial engineering & automation, Human–computer interaction, Robot, 0501 psychology and cognitive sciences, Natural language
Abstract: We describe a behavioral navigation approach that leverages the rich semantic structure of human environments to enable robots to navigate without an explicit geometric representation of the world. Based on this approach, we then present our efforts to allow robots to follow navigation instructions in natural language. With our proof-of-concept implementation, we were able to translate natural language navigation commands into a sequence of behaviors that could then be executed by a robot to reach a desired goal.
Published: 2018

35. Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks

Author: Animesh Garg, Li Fei-Fei, Krishnan Srinivasan, Parth Shah, Michelle A. Lee, Yuke Zhu, Jeannette Bohg, and Silvio Savarese
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, 0209 industrial biotechnology, Computer Science - Artificial Intelligence, Computer science, 010401 analytical chemistry, 02 engineering and technology, 01 natural sciences, 0104 chemical sciences, Task (project management), Visualization, Machine Learning (cs.LG), Computer Science - Robotics, 020901 industrial engineering & automation, Artificial Intelligence (cs.AI), Human–computer interaction, Task analysis, Robot, Reinforcement learning, Representation (mathematics), Robotics (cs.RO), Haptic technology
Abstract: Contact-rich manipulation tasks in unstructured environments often require both haptic and visual feedback. However, it is non-trivial to manually design a robot controller that combines modalities with very different characteristics. While deep reinforcement learning has shown success in learning control policies for high-dimensional inputs, these algorithms are generally intractable to deploy on real robots due to sample complexity. We use self-supervision to learn a compact and multimodal representation of our sensory inputs, which can then be used to improve the sample efficiency of our policy learning. We evaluate our method on a peg insertion task, generalizing over different geometry, configurations, and clearances, while being robust to external perturbations. Results for simulated and real robot experiments are presented., Comment: ICRA 2019
Published: 2018
Full Text: View/download PDF

36. SoPhie: An Attentive GAN for Predicting Paths Compliant to Social and Physical Constraints

Author: Amir Sadeghian, Hamid Rezatofighi, Silvio Savarese, Ali Sadeghian, Vineet Kosaraju, and Noriaki Hirose
Subjects: FOS: Computer and information sciences, Social robot, Computer science, business.industry, Multi-agent system, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Context (language use), 02 engineering and technology, 010501 environmental sciences, 01 natural sciences, Physical information, Component (UML), Path (graph theory), 0202 electrical engineering, electronic engineering, information engineering, Trajectory, 020201 artificial intelligence & image processing, Motion planning, Artificial intelligence, business, 0105 earth and related environmental sciences
Abstract: This paper addresses the problem of path prediction for multiple interacting agents in a scene, which is a crucial step for many autonomous platforms such as self-driving cars and social robots. We present \textit{SoPhie}; an interpretable framework based on Generative Adversarial Network (GAN), which leverages two sources of information, the path history of all the agents in a scene, and the scene context information, using images of the scene. To predict a future path for an agent, both physical and social information must be leveraged. Previous work has not been successful to jointly model physical and social interactions. Our approach blends a social attention mechanism with a physical attention that helps the model to learn where to look in a large scene and extract the most salient parts of the image relevant to the path. Whereas, the social attention component aggregates information across the different agent interactions and extracts the most important trajectory information from the surrounding neighbors. SoPhie also takes advantage of GAN to generates more realistic samples and to capture the uncertain nature of the future paths by modeling its distribution. All these mechanisms enable our approach to predict socially and physically plausible paths for the agents and to achieve state-of-the-art performance on several different trajectory forecasting benchmarks.
Published: 2018
Full Text: View/download PDF

37. Neural Task Graphs: Generalizing to Unseen Tasks from a Single Video Demonstration

Author: De-An Huang, Yuke Zhu, Li Fei-Fei, Animesh Garg, Suraj Nair, Juan Carlos Niebles, Danfei Xu, and Silvio Savarese
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer science, Generalization, business.industry, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, 02 engineering and technology, Visual reasoning, 010501 environmental sciences, 01 natural sciences, Graph, Machine Learning (cs.LG), Computer Science - Robotics, Artificial Intelligence (cs.AI), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, business, Robotics (cs.RO), 0105 earth and related environmental sciences
Abstract: Our goal is to generate a policy to complete an unseen task given just a single video demonstration of the task in a given domain. We hypothesize that to successfully generalize to unseen complex tasks from a single video demonstration, it is necessary to explicitly incorporate the compositional structure of the tasks into the model. To this end, we propose Neural Task Graph (NTG) Networks, which use conjugate task graph as the intermediate representation to modularize both the video demonstration and the derived policy. We empirically show NTG achieves inter-task generalization on two complex tasks: Block Stacking in BulletPhysics and Object Collection in AI2-THOR. NTG improves data efficiency with visual input as well as achieve strong generalization without the need for dense hierarchical supervision. We further show that similar performance trends hold when applied to real-world data. We show that NTG can effectively predict task structure on the JIGSAWS surgical dataset and generalize to unseen tasks., Comment: CVPR 2019
Published: 2018
Full Text: View/download PDF

38. Translating Navigation Instructions in Natural Language to a High-Level Plan for Behavioral Robot Navigation

Author: Kevin Chen, Juan Carlos Niebles, Xiaoxue Zang, Alvaro Soto, Silvio Savarese, Ashwini Pokle, and Marynel Vázquez
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, business.industry, Computer science, Computer Science - Artificial Intelligence, Deep learning, 02 engineering and technology, 010501 environmental sciences, 01 natural sciences, Artificial Intelligence (cs.AI), Knowledge base, Human–computer interaction, 0202 electrical engineering, electronic engineering, information engineering, Robot, Leverage (statistics), 020201 artificial intelligence & image processing, Artificial intelligence, business, Computation and Language (cs.CL), Natural language, Reflection mapping, 0105 earth and related environmental sciences
Abstract: We propose an end-to-end deep learning model for translating free-form natural language instructions to a high-level plan for behavioral robot navigation. We use attention models to connect information from both the user instructions and a topological representation of the environment. We evaluate our model’s performance on a new dataset containing 10,050 pairs of navigation instructions. Our model significantly outperforms baseline approaches. Furthermore, our results suggest that it is possible to leverage the environment map as a relevant knowledge base to facilitate the translation of free-form navigational instruction.
Published: 2018
Full Text: View/download PDF

39. Learning Task-Oriented Grasping for Tool Manipulation from Simulated Self-Supervision

Author: Silvio Savarese, Viraj Mehta, Kuan Fang, Animesh Garg, Andrey Kurenkov, Li Fei-Fei, and Yuke Zhu
Subjects: FOS: Computer and information sciences, 0209 industrial biotechnology, Computer Science - Machine Learning, Computer science, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Machine Learning (stat.ML), 02 engineering and technology, 021001 nanoscience & nanotechnology, Task (project management), Machine Learning (cs.LG), Computer Science - Robotics, 020901 industrial engineering & automation, Self supervision, Human–computer interaction, Statistics - Machine Learning, Task oriented, Robot, 0210 nano-technology, Robotics (cs.RO)
Abstract: Tool manipulation is vital for facilitating robots to complete challenging task goals. It requires reasoning about the desired effect of the task and thus properly grasping and manipulating the tool to achieve the task. Task-agnostic grasping optimizes for grasp robustness while ignoring crucial task-specific constraints. In this paper, we propose the Task-Oriented Grasping Network (TOG-Net) to jointly optimize both task-oriented grasping of a tool and the manipulation policy for that tool. The training process of the model is based on large-scale simulated self-supervision with procedurally generated tool objects. We perform both simulated and real-world experiments on two tool-based manipulation tasks: sweeping and hammering. Our model achieves overall 71.1% task success rate for sweeping and 80.0% task success rate for hammering. Supplementary material is available at: bit.ly/task-oriented-grasp, Comment: RSS 2018
Published: 2018
Full Text: View/download PDF

40. Robust real-time tracking combining 3D shape, color, and motion

Author: Jesse Levinson, David Held, Silvio Savarese, and Sebastian Thrun
Subjects: 0209 industrial biotechnology, business.industry, Applied Mathematics, Mechanical Engineering, Posterior probability, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 02 engineering and technology, Object (computer science), Tracking (particle physics), Motion (physics), Tracking error, 020901 industrial engineering & automation, Artificial Intelligence, Laser tracker, Robustness (computer science), Modeling and Simulation, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, Electrical and Electronic Engineering, business, Real time tracking, Software, Mathematics
Abstract: Real-time tracking algorithms often suffer from low accuracy and poor robustness when confronted with difficult, real-world data. We present a tracker that combines 3D shape, color (when available), and motion cues to accurately track moving objects in real-time. Our tracker allocates computational effort based on the shape of the posterior distribution. Starting with a coarse approximation to the posterior, the tracker successively refines this distribution, increasing in tracking accuracy over time. The tracker can thus be run for any amount of time, after which the current approximation to the posterior is returned. Even at a minimum runtime of 0.37 ms per object, our method outperforms all of the baseline methods of similar speed by at least 25% in root-mean-square (RMS) tracking error. If our tracker is allowed to run for longer, the accuracy continues to improve, and it continues to outperform all baseline methods. Our tracker is thus anytime, allowing the speed or accuracy to be optimized based on the needs of the application. By combining 3D shape, color (when available), and motion cues in a probabilistic framework, our tracker is able to robustly handle changes in viewpoint, occlusions, and lighting variations for moving objects of a variety of shapes, sizes, and distances.
Published: 2015

41. Scene Semantic Reconstruction from Egocentric RGB-D-Thermal Videos

Author: Silvio Savarese, Ozan Sener, and Rachel Luo
Subjects: Data stream, 0209 industrial biotechnology, Computer science, business.industry, 3D reconstruction, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 02 engineering and technology, Semantic property, Observer (special relativity), 020901 industrial engineering & automation, Robustness (computer science), 0202 electrical engineering, electronic engineering, information engineering, RGB color model, Leverage (statistics), 020201 artificial intelligence & image processing, Segmentation, Computer vision, Artificial intelligence, business, ComputingMethodologies_COMPUTERGRAPHICS
Abstract: In this paper we focus on the problem of inferring geometric and semantic properties of a complex scene where humans interact with objects from egocentric views. Unlike most previous work, our goal is to leverage a multimodal sensory stream composed of RGB, depth, and thermal (RGB-D-T) signals and use this data stream as an input to a new framework for joint 6 DOF camera localization, 3D reconstruction, and semantic segmentation. As our extensive experimental evaluation shows, the combination of different sensing modalities allows us to achieve greater robustness in situations where both the observer and the objects in the scene move rapidly (a challenging situation for traditional semantic reconstruction methods). Moreover, we contribute a new dataset that includes a large number of egocentric RGB-D-T videos of humans performing daily real-world activities as well as a new demonstration hardware platform for acquiring such a dataset.
Published: 2017

42. Weakly Supervised 3D Reconstruction with Adversarial Constraint

Author: Manmohan Chandraker, Animesh Garg, Silvio Savarese, JunYoung Gwak, and Christopher Choy
Subjects: FOS: Computer and information sciences, Computer science, business.industry, Computer Vision and Pattern Recognition (cs.CV), 3D reconstruction, Pooling, Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 020207 software engineering, CAD, 02 engineering and technology, Real image, Machine learning, computer.software_genre, Backpropagation, law.invention, Constraint (information theory), law, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, Scale (map), business, Manifold (fluid mechanics), computer
Abstract: Supervised 3D reconstruction has witnessed a significant progress through the use of deep neural networks. However, this increase in performance requires large scale annotations of 2D/3D data. In this paper, we explore inexpensive 2D supervision as an alternative for expensive 3D CAD annotation. Specifically, we use foreground masks as weak supervision through a raytrace pooling layer that enables perspective projection and backpropagation. Additionally, since the 3D reconstruction from masks is an ill posed problem, we propose to constrain the 3D reconstruction to the manifold of unlabeled realistic 3D shapes that match mask observations. We demonstrate that learning a log-barrier solution to this constrained optimization problem resembles the GAN objective, enabling the use of existing tools for training GANs. We evaluate and analyze the manifold constrained reconstruction on various datasets for single and multi-view reconstruction of both synthetic and real images.
Published: 2017

43. SEGCloud: Semantic Segmentation of 3D Point Clouds

Author: Iro Armeni, Silvio Savarese, JunYoung Gwak, Christopher Choy, and Lyne P. Tchapmi
Subjects: FOS: Computer and information sciences, Conditional random field, Artificial neural network, business.industry, Computer science, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Trilinear interpolation, Point cloud, 020207 software engineering, Pattern recognition, 02 engineering and technology, computer.software_genre, Voxel, 0202 electrical engineering, electronic engineering, information engineering, Leverage (statistics), 020201 artificial intelligence & image processing, Segmentation, Artificial intelligence, Differentiable function, business, computer
Abstract: 3D semantic scene labeling is fundamental to agents operating in the real world. In particular, labeling raw 3D point sets from sensors provides fine-grained semantics. Recent works leverage the capabilities of Neural Networks (NNs), but are limited to coarse voxel predictions and do not explicitly enforce global consistency. We present SEGCloud, an end-to-end framework to obtain 3D point-level segmentation that combines the advantages of NNs, trilinear interpolation(TI) and fully connected Conditional Random Fields (FC-CRF). Coarse voxel predictions from a 3D Fully Convolutional NN are transferred back to the raw 3D points via trilinear interpolation. Then the FC-CRF enforces global consistency and provides fine-grained semantics on the points. We implement the latter as a differentiable Recurrent NN to allow joint optimization. We evaluate the framework on two indoor and two outdoor 3D datasets (NYU V2, S3DIS, KITTI, Semantic3D.net), and show performance comparable or superior to the state-of-the-art on all datasets., Comment: Accepted as a spotlight at the International Conference of 3D Vision (3DV 2017)
Published: 2017

44. Lattice Long Short-Term Memory for Human Action Recognition

Author: Kevin Chen, Bertram E. Shi, Dit-Yan Yeung, Kui Jia, Lin Sun, and Silvio Savarese
Subjects: FOS: Computer and information sciences, Computer science, business.industry, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Optical flow, Pattern recognition, 02 engineering and technology, 010501 environmental sciences, 01 natural sciences, Recurrent neural network, Gesture recognition, Lattice (order), 0202 electrical engineering, electronic engineering, information engineering, Action recognition, RGB color model, 020201 artificial intelligence & image processing, Artificial intelligence, business, 0105 earth and related environmental sciences
Abstract: Human actions captured in video sequences are three-dimensional signals characterizing visual appearance and motion dynamics. To learn action patterns, existing methods adopt Convolutional and/or Recurrent Neural Networks (CNNs and RNNs). CNN based methods are effective in learning spatial appearances, but are limited in modeling long-term motion dynamics. RNNs, especially Long Short-Term Memory (LSTM), are able to learn temporal motion dynamics. However, naively applying RNNs to video sequences in a convolutional manner implicitly assumes that motions in videos are stationary across different spatial locations. This assumption is valid for short-term motions but invalid when the duration of the motion is long. In this work, we propose Lattice-LSTM (L2STM), which extends LSTM by learning independent hidden state transitions of memory cells for individual spatial locations. This method effectively enhances the ability to model dynamics across time and addresses the non-stationary issue of long-term motion dynamics without significantly increasing the model complexity. Additionally, we introduce a novel multi-modal training procedure for training our network. Unlike traditional two-stream architectures which use RGB and optical flow information as input, our two-stream model leverages both modalities to jointly train both input gates and both forget gates in the network rather than treating the two streams as separate entities with no information about the other. We apply this end-to-end system to benchmark datasets (UCF-101 and HMDB-51) of human action recognition. Experiments show that on both datasets, our proposed method outperforms all existing ones that are based on LSTM and/or CNNs of similar model complexities., Comment: ICCV2017
Published: 2017

45. Tracking the Untrackable: Learning to Track Multiple Cues with Long-Term Dependencies

Author: Alexandre Alahi, Amir Sadeghian, and Silvio Savarese
Subjects: FOS: Computer and information sciences, Computer science, business.industry, Computer Vision and Pattern Recognition (cs.CV), Feature extraction, Computer Science - Computer Vision and Pattern Recognition, 020207 software engineering, 02 engineering and technology, Sensor fusion, Tracking (particle physics), Object detection, Term (time), Recurrent neural network, Robustness (computer science), 0202 electrical engineering, electronic engineering, information engineering, Benchmark (computing), 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, business
Abstract: The majority of existing solutions to the Multi-Target Tracking (MTT) problem do not combine cues in a coherent end-to-end fashion over a long period of time. However, we present an online method that encodes long-term temporal dependencies across multiple cues. One key challenge of tracking methods is to accurately track occluded targets or those which share similar appearance properties with surrounding objects. To address this challenge, we present a structure of Recurrent Neural Networks (RNN) that jointly reasons on multiple cues over a temporal window. We are able to correct many data association errors and recover observations from an occluded state. We demonstrate the robustness of our data-driven approach by tracking multiple targets using their appearance, motion, and even interactions. Our method outperforms previous works on multiple publicly available datasets including the challenging MOT benchmark.
Published: 2017

46. Adversarially Robust Policy Learning: Active construction of physically-plausible perturbations

Author: Ajay Mandlekar, Yuke Zhu, Animesh Garg, Li Fei-Fei, and Silvio Savarese
Subjects: 0209 industrial biotechnology, Computer science, business.industry, Computation, 02 engineering and technology, Machine learning, computer.software_genre, Domain (software engineering), 020901 industrial engineering & automation, 0202 electrical engineering, electronic engineering, information engineering, Code (cryptography), Reinforcement learning, Robot, 020201 artificial intelligence & image processing, Artificial intelligence, State (computer science), Resilience (network), business, computer, Vulnerability (computing)
Abstract: Policy search methods in reinforcement learning have demonstrated success in scaling up to larger problems beyond toy examples. However, deploying these methods on real robots remains challenging due to the large sample complexity required during learning and their vulnerability to malicious intervention. We introduce Adversarially Robust Policy Learning (ARPL), an algorithm that leverages active computation of physically-plausible adversarial examples during training to enable robust policy learning in the source domain and robust performance under both random and adversarial input perturbations. We evaluate ARPL on four continuous control tasks and show superior resilience to changes in physical environment dynamics parameters and environment state as compared to state-of-the-art robust policy learning methods. Code, data, and additional experimental results are available at: stanfordvl.github.io/ARPL
Published: 2017

47. Deep View Morphing

Author: Junghyun Kwon, Max E. McFarland, Dinghuang Ji, and Silvio Savarese
Subjects: FOS: Computer and information sciences, Pixel, business.industry, Computer science, Computer Vision and Pattern Recognition (cs.CV), Visibility (geometry), Interpolation (computer graphics), Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 02 engineering and technology, 010501 environmental sciences, 01 natural sciences, Convolutional neural network, View synthesis, Morphing, Rectification, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, business, 0105 earth and related environmental sciences, Interpolation
Abstract: Recently, convolutional neural networks (CNN) have been successfully applied to view synthesis problems. However, such CNN-based methods can suffer from lack of texture details, shape distortions, or high computational complexity. In this paper, we propose a novel CNN architecture for view synthesis called "Deep View Morphing" that does not suffer from these issues. To synthesize a middle view of two input images, a rectification network first rectifies the two input images. An encoder-decoder network then generates dense correspondences between the rectified images and blending masks to predict the visibility of pixels of the rectified images in the middle view. A view morphing network finally synthesizes the middle view using the dense correspondences and blending masks. We experimentally show the proposed method significantly outperforms the state-of-the-art CNN-based view synthesis method., Accepted to CVPR 2017
Published: 2017

48. Unsupervised camera localization in crowded spaces

Author: Silvio Savarese, Li Fei-Fei, Alexandre Alahi, and Judson Wilson
Subjects: Matching (statistics), Social robot, Optimization problem, business.industry, Computer science, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 020206 networking & telecommunications, 02 engineering and technology, Tracking (particle physics), Motion (physics), 0202 electrical engineering, electronic engineering, information engineering, Unsupervised learning, 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, business
Abstract: Existing camera networks in public spaces such as train terminals or malls can help social robots to navigate crowded scenes. However, the localization of the cameras is required, i.e., the positions and poses of all cameras in a unique reference. In this work, we estimate the relative location of any pair of cameras by solely using noisy trajectories observed from each camera. We propose a fully unsupervised learning technique using unlabelled pedestrians motion patterns captured in crowded scenes. We first estimate the pairwise camera parameters by optimally matching single-view pedestrian tracks using social awareness. Then, we show the impact of jointly estimating the network parameters. This is done by formulating a nonlinear least square optimization problem, leveraging a continuous approximation of the matching function. We evaluate our approach in real-world environments such as train terminals, where several hundreds of individuals need to be tracked across dozens of cameras every second.
Published: 2017

49. Watch-n-Patch: Unsupervised Learning of Actions and Relations

Author: Chenxia Wu, Ashutosh Saxena, Bart Selman, Silvio Savarese, Jiemi Zhang, and Ozan Sener
Subjects: FOS: Computer and information sciences, 0209 industrial biotechnology, Computer science, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, 02 engineering and technology, Variation (game tree), Machine Learning (cs.LG), Computer Science - Robotics, 020901 industrial engineering & automation, Artificial Intelligence, Human–computer interaction, 0202 electrical engineering, electronic engineering, information engineering, Cluster analysis, Set (psychology), Hidden Markov model, business.industry, Applied Mathematics, Computer Science - Learning, Computational Theory and Mathematics, Action (philosophy), Unsupervised learning, Robot, 020201 artificial intelligence & image processing, Computer Vision and Pattern Recognition, Artificial intelligence, business, Robotics (cs.RO), Software
Abstract: There is a large variation in the activities that humans perform in their everyday lives. We consider modeling these composite human activities which comprises multiple basic level actions in a completely unsupervised setting. Our model learns high-level co-occurrence and temporal relations between the actions. We consider the video as a sequence of short-term action clips, which contains human-words and object-words. An activity is about a set of action-topics and object-topics indicating which actions are present and which objects are interacting with. We then propose a new probabilistic model relating the words and the topics. It allows us to model long-range action relations that commonly exist in the composite activities, which is challenging in previous works. We apply our model to the unsupervised action segmentation and clustering, and to a novel application that detects forgotten actions, which we call action patching. For evaluation, we contribute a new challenging RGB-D activity video dataset recorded by the new Kinect v2, which contains several human daily activities as compositions of multiple actions interacting with different objects. Moreover, we develop a robotic system that watches people and reminds people by applying our action patching algorithm. Our robotic setup can be easily deployed on any assistive robot., arXiv admin note: text overlap with arXiv:1512.04208
Published: 2017

50. Subcategory-Aware Convolutional Neural Networks for Object Proposals and Detection

Author: Wongun Choi, Silvio Savarese, Yu Xiang, and Yuanqing Lin
Subjects: FOS: Computer and information sciences, Subcategory, 050210 logistics & transportation, Computer science, business.industry, Computer Vision and Pattern Recognition (cs.CV), 05 social sciences, Computer Science - Computer Vision and Pattern Recognition, Pattern recognition, 02 engineering and technology, Object (computer science), 3D pose estimation, Convolutional neural network, Object detection, Object-class detection, 0502 economics and business, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Viola–Jones object detection framework, Computer vision, Artificial intelligence, business, Pose
Abstract: In CNN-based object detection methods, region proposal becomes a bottleneck when objects exhibit significant scale variation, occlusion or truncation. In addition, these methods mainly focus on 2D object detection and cannot estimate detailed properties of objects. In this paper, we propose subcategory-aware CNNs for object detection. We introduce a novel region proposal network that uses subcategory information to guide the proposal generating process, and a new detection network for joint detection and subcategory classification. By using subcategories related to object pose, we achieve state-of-the-art performance on both detection and pose estimation on commonly used benchmarks., Comment: Published in WACV 2017
Published: 2017

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

72 results on '"Silvio Savarese"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources