Author: "Silvio Savarese" / Publisher: ieee - Searchworks@Jio Institute Digital Library Search Results

1. Topological Planning with Transformers for Vision-and-Language Navigation

Author: Junshen K. Chen, Jo Chuang, Kevin Chen, Marynel Vázquez, and Silvio Savarese
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Backtracking, Computer science, business.industry, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Robotics, Plan (drawing), Modular design, Topology, Computer Science - Robotics, Artificial Intelligence (cs.AI), Control theory, Topological map, Artificial intelligence, business, Robotics (cs.RO), Computation and Language (cs.CL), Natural language, Transformer (machine learning model)
Abstract: Conventional approaches to vision-and-language navigation (VLN) are trained end-to-end but struggle to perform well in freely traversable environments. Inspired by the robotics community, we propose a modular approach to VLN using topological maps. Given a natural language instruction and topological map, our approach leverages attention mechanisms to predict a navigation plan in the map. The plan is then executed with low-level actions (e.g. forward, rotate) using a robust controller. Experiments show that our method outperforms previous end-to-end approaches, generates interpretable navigation plans, and exhibits intelligent behaviors such as backtracking.
Published: 2021
Full Text: View/download PDF

2. ReLMoGen: Integrating Motion Generation in Reinforcement Learning for Mobile Manipulation

Author: Or Litany, Alexander Toshev, Roberto Martín-Martín, Fei Xia, Chengshu Li, and Silvio Savarese
Subjects: business.industry, Computer science, Trajectory, Robot, Reinforcement learning, Robotics, Artificial intelligence, Set (psychology), business, Executor, Motion (physics), Generator (mathematics)
Abstract: Many Reinforcement Learning (RL) approaches use joint control signals (positions, velocities, torques) as action space for continuous control tasks. We propose to lift the action space to a higher level in the form of subgoals for a motion generator (a combination of motion planner and trajectory executor). We argue that, by lifting the action space and by leveraging sampling-based motion planners, we can efficiently use RL to solve complex, long-horizon tasks that could not be solved with existing RL methods in the original action space. We propose ReLMoGen – a framework that combines a learned policy to predict subgoals and a motion generator to plan and execute the motion needed to reach these subgoals. To validate our method, we apply ReLMoGen to two types of tasks: 1) Interactive Navigation tasks, navigation problems where interactions with the environment are required to reach the destination, and 2) Mobile Manipulation tasks, manipulation tasks that require moving the robot base. These problems are challenging because they are usually long-horizon, hard to explore during training, and comprise alternating phases of navigation and interaction. Our method is benchmarked on a diverse set of seven robotics tasks in photo-realistic simulation environments. In all settings, ReLMoGen outperforms state-of-the-art RL and Hierarchical RL baselines. ReLMoGen also shows outstanding transferability between different motion generators at test time, indicating a great potential to transfer to real robots. For more information, please visit project website: http://svl.stanford.edu/projects/relmogen.
Published: 2021
Full Text: View/download PDF

3. Learning Multi-Arm Manipulation Through Collaborative Teleoperation

Author: Roberto Martín-Martín, Josiah Wong, Silvio Savarese, Yuke Zhu, Li Fei-Fei, Ajay Mandlekar, and Albert Tung
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Robot kinematics, Data collection, Computer Science - Artificial Intelligence, Computer science, Machine Learning (cs.LG), Computer Science - Robotics, Artificial Intelligence (cs.AI), Human–computer interaction, Teleoperation, Benchmark (computing), Robot, Set (psychology), Robotics (cs.RO), Robotic arm, Desk
Abstract: Imitation Learning (IL) is a powerful paradigm to teach robots to perform manipulation tasks by allowing them to learn from human demonstrations collected via teleoperation, but has mostly been limited to single-arm manipulation. However, many real-world tasks require multiple arms, such as lifting a heavy object or assembling a desk. Unfortunately, applying IL to multi-arm manipulation tasks has been challenging -- asking a human to control more than one robotic arm can impose significant cognitive burden and is often only possible for a maximum of two robot arms. To address these challenges, we present Multi-Arm RoboTurk (MART), a multi-user data collection platform that allows multiple remote users to simultaneously teleoperate a set of robotic arms and collect demonstrations for multi-arm tasks. Using MART, we collected demonstrations for five novel two and three-arm tasks from several geographically separated users. From our data we arrived at a critical insight: most multi-arm tasks do not require global coordination throughout its full duration, but only during specific moments. We show that learning from such data consequently presents challenges for centralized agents that directly attempt to model all robot actions simultaneously, and perform a comprehensive study of different policy architectures with varying levels of centralization on our tasks. Finally, we propose and evaluate a base-residual policy framework that allows trained policies to better adapt to the mixed coordination setting common in multi-arm manipulation, and show that a centralized policy augmented with a decentralized residual model outperforms all other models on our set of benchmark tasks. Additional results and videos at https://roboturk.stanford.edu/multiarm ., First two authors contributed equally
Published: 2021
Full Text: View/download PDF

4. Robot Navigation in Constrained Pedestrian Environments using Reinforcement Learning

Author: Can Liu, Claudia Perez-D'Arpino, Patrick Goebel, Roberto Martín-Martín, and Silvio Savarese
Subjects: FOS: Computer and information sciences, Computer Science - Artificial Intelligence, Computer science, Social navigation, Mobile robot, Pedestrian, Computer Science - Robotics, Artificial Intelligence (cs.AI), Human–computer interaction, Scalability, Trajectory, Robot, Reinforcement learning, Adaptation (computer science), Robotics (cs.RO)
Abstract: Navigating fluently around pedestrians is a necessary capability for mobile robots deployed in human environments, such as buildings and homes. While research on social navigation has focused mainly on the scalability with the number of pedestrians in open spaces, typical indoor environments present the additional challenge of constrained spaces such as corridors and doorways that limit maneuverability and influence patterns of pedestrian interaction. We present an approach based on reinforcement learning (RL) to learn policies capable of dynamic adaptation to the presence of moving pedestrians while navigating between desired locations in constrained environments. The policy network receives guidance from a motion planner that provides waypoints to follow a globally planned trajectory, whereas RL handles the local interactions. We explore a compositional principle for multi-layout training and find that policies trained in a small set of geometrically simple layouts successfully generalize to more complex unseen layouts that exhibit composition of the structural elements available during training. Going beyond walls-world like domains, we show transfer of the learned policy to unseen 3D reconstructions of two real environments. These results support the applicability of the compositional principle to navigation in real-world buildings and indicate promising usage of multi-agent simulation within reconstructed environments for tasks that involve interaction. https://ai.stanford.edu/∼cdarpino/socialnavconstrained/
Published: 2021
Full Text: View/download PDF

5. Localizing Against Drawn Maps via Spline-Based Registration

Author: Silvio Savarese, Marynel Vázquez, and Kevin Chen
Subjects: Computer science, business.industry, 05 social sciences, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 010501 environmental sciences, 01 natural sciences, Convolutional neural network, Task (project management), Spline (mathematics), Lidar, 0502 economics and business, Robot, Computer vision, Artificial intelligence, 050207 economics, business, Thin plate spline, Baseline (configuration management), Rigid transformation, ComputingMethodologies_COMPUTERGRAPHICS, 0105 earth and related environmental sciences
Abstract: We propose a method to facilitate robot navigation relative to sketched maps of human environments. Our main contribution centers around using thin plate splines for registering the robot’s LIDAR observation with the hand-drawn maps. Thin plate splines are particularly effective for this task because they are able to handle many of the nonrigid deformations commonly seen in sketches of maps, which render traditional rigid transformations inappropriate. Our proposed approach uses a convolutional neural network to efficiently predict the control points which define the spline transform, from which we then compute the pose of the robot on the hand drawn map for navigation purposes. Our systematic evaluations in simulation using a synthetic dataset and real, hand-drawn sketches show that the proposed spline-based registration approach outperforms baseline methods.
Published: 2020
Full Text: View/download PDF

6. 3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera

Author: Amir Roshan Zamir, JunYoung Gwak, Zhi-Yang He, Iro Armeni, Silvio Savarese, Martin Fischer, and Jitendra Malik
Subjects: FOS: Computer and information sciences, business.industry, Computer science, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 020207 software engineering, 02 engineering and technology, Graph, Visualization, Computer Science - Robotics, 3d space, Framing (construction), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Scene graph, Polygon mesh, Computer vision, Artificial intelligence, business, Robotics (cs.RO)
Abstract: A comprehensive semantic understanding of a scene is important for many applications - but in what space should diverse semantic information (e.g., objects, scene categories, material types, texture, etc.) be grounded and what should be its structure? Aspiring to have one unified structure that hosts diverse types of semantics, we follow the Scene Graph paradigm in 3D, generating a 3D Scene Graph. Given a 3D mesh and registered panoramic images, we construct a graph that spans the entire building and includes semantics on objects (e.g., class, material, and other attributes), rooms (e.g., scene category, volume, etc.) and cameras (e.g., location, etc.), as well as the relationships among these entities. However, this process is prohibitively labor heavy if done manually. To alleviate this we devise a semi-automatic framework that employs existing detection methods and enhances them using two main constraints: I. framing of query images sampled on panoramas to maximize the performance of 2D detectors, and II. multi-view consistency enforcement across 2D detections that originate in different camera locations., Comment: ICCV 2019
Published: 2019
Full Text: View/download PDF

7. TopNet: Structural Point Cloud Decoder

Author: Vineet Kosaraju, Silvio Savarese, Hamid Rezatofighi, Lyne P. Tchapmi, and Ian Reid
Subjects: Structure (mathematical logic), Theoretical computer science, business.industry, Computer science, Point cloud, 020207 software engineering, Cloud computing, 02 engineering and technology, Point group, Object (computer science), Manifold, Set (abstract data type), Tree structure, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Geometric primitive, Artificial intelligence, business
Abstract: 3D point cloud generation is of great use for 3D scene modeling and understanding. Real-world 3D object point clouds can be properly described by a collection of low-level and high-level structures such as surfaces, geometric primitives, semantic parts,etc. In fact, there exist many different representations of a 3D object point cloud as a set of point groups. Existing frameworks for point cloud genera-ion either do not consider structure in their proposed solutions, or assume and enforce a specific structure/topology,e.g. a collection of manifolds or surfaces, for the generated point cloud of a 3D object. In this work, we pro-pose a novel decoder that generates a structured point cloud without assuming any specific structure or topology on the underlying point set. Our decoder is softly constrained to generate a point cloud following a hierarchical rooted tree structure. We show that given enough capacity and allowing for redundancies, the proposed decoder is very flexible and able to learn any arbitrary grouping of points including any topology on the point set. We evaluate our decoder on the task of point cloud generation for 3D point cloud shape completion. Combined with encoders from existing frameworks, we show that our proposed decoder significantly outperforms state-of-the-art 3D point cloud completion methods on the Shapenet dataset
Published: 2019
Full Text: View/download PDF

8. Mechanical Search: Multi-Step Retrieval of a Target Object Occluded by Clutter

Author: Roberto Martín-Martín, Andrey Kurenkov, Ken Goldberg, Animesh Garg, Silvio Savarese, Ashwin Balakrishna, David Wang, Michael Danielczuk, and Matthew Matl
Subjects: FOS: Computer and information sciences, 0209 industrial biotechnology, Computer science, business.industry, GRASP, 02 engineering and technology, Image segmentation, Visualization, Computer Science - Robotics, 020901 industrial engineering & automation, 0202 electrical engineering, electronic engineering, information engineering, Task analysis, Robot, Clutter, 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, business, Robotics (cs.RO), Heap (data structure)
Abstract: When operating in unstructured environments such as warehouses, homes, and retail centers, robots are frequently required to interactively search for and retrieve specific objects from cluttered bins, shelves, or tables. Mechanical Search describes the class of tasks where the goal is to locate and extract a known target object. In this paper, we formalize Mechanical Search and study a version where distractor objects are heaped over the target object in a bin. The robot uses an RGBD perception system and control policies to iteratively select, parameterize, and perform one of 3 actions -- push, suction, grasp -- until the target object is extracted, or either a time limit is exceeded, or no high confidence push or grasp is available. We present a study of 5 algorithmic policies for mechanical search, with 15,000 simulated trials and 300 physical trials for heaps ranging from 10 to 20 objects. Results suggest that success can be achieved in this long-horizon task with algorithmic policies in over 95% of instances and that the number of actions required scales approximately linearly with the size of the heap. Code and supplementary material can be found at http://ai.stanford.edu/mech-search ., Comment: To appear in IEEE International Conference on Robotics and Automation (ICRA), 2019. 9 pages with 4 figures
Published: 2019
Full Text: View/download PDF

9. GONet: A Semi-Supervised Deep Learning Approach For Traversability Estimation

Author: Amir Sadeghian, Silvio Savarese, Noriaki Hirose, Marynel Vázquez, and Patrick Goebel
Subjects: FOS: Computer and information sciences, 0209 industrial biotechnology, Traverse, business.industry, Computer science, Computer Vision and Pattern Recognition (cs.CV), Deep learning, Computer Science - Computer Vision and Pattern Recognition, Mobile robot, 02 engineering and technology, Machine learning, computer.software_genre, Machine Learning (cs.LG), Computer Science - Robotics, Computer Science - Learning, 020901 industrial engineering & automation, 0202 electrical engineering, electronic engineering, information engineering, Leverage (statistics), Robot, 020201 artificial intelligence & image processing, Artificial intelligence, business, Robotics (cs.RO), computer
Abstract: We present semi-supervised deep learning approaches for traversability estimation from fisheye images. Our method, GONet, and the proposed extensions leverage Generative Adversarial Networks (GANs) to effectively predict whether the area seen in the input image(s) is safe for a robot to traverse. These methods are trained with many positive images of traversable places, but just a small set of negative images depicting blocked and unsafe areas. This makes the proposed methods practical. Positive examples can be collected easily by simply operating a robot through traversable spaces, while obtaining negative examples is time consuming, costly, and potentially dangerous. Through extensive experiments and several demonstrations, we show that the proposed traversability estimation approaches are robust and can generalize to unseen scenarios. Further, we demonstrate that our methods are memory efficient and fast, allowing for real-time operation on a mobile robot with single or stereo fisheye cameras. As part of our contributions, we open-source two new datasets for traversability estimation. These datasets are composed of approximately 24h of videos from more than 25 indoor environments. Our methods outperform baseline approaches for traversability estimation on these new datasets., Comment: 8 pages, 7 figures, 3 tables
Published: 2018
Full Text: View/download PDF

10. Demo2Vec: Reasoning Object Affordances from Online Videos

Author: Silvio Savarese, Te-Lin Wu, Joseph J. Lim, Kuan Fang, and Daniel Yang
Subjects: Computer science, business.industry, 02 engineering and technology, 010501 environmental sciences, Object (computer science), 01 natural sciences, Recurrent neural network, Action (philosophy), Feature (computer vision), Human–computer interaction, 0202 electrical engineering, electronic engineering, information engineering, Robot, 020201 artificial intelligence & image processing, Artificial intelligence, business, Affordance, 0105 earth and related environmental sciences
Abstract: Watching expert demonstrations is an important way for humans and robots to reason about affordances of unseen objects. In this paper, we consider the problem of reasoning object affordances through the feature embedding of demonstration videos. We design the Demo2Vec model which learns to extract embedded vectors of demonstration videos and predicts the interaction region and the action label on a target image of the same object. We introduce the Online Product Review dataset for Affordance (OPRA) by collecting and labeling diverse YouTube product review videos. Our Demo2Vec model outperforms various recurrent neural network baselines on the collected dataset.
Published: 2018
Full Text: View/download PDF

11. Gibson Env: Real-World Perception for Embodied Agents

Author: Zhi-Yang He, Amir Roshan Zamir, Jitendra Malik, Fei Xia, Silvio Savarese, and Alexander Sax
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, 0209 industrial biotechnology, Visual perception, Computer Science - Artificial Intelligence, Computer science, Computer Vision and Pattern Recognition (cs.CV), media_common.quotation_subject, Computer Science - Computer Vision and Pattern Recognition, 02 engineering and technology, Space (commercial competition), Machine Learning (cs.LG), Computer Science - Robotics, Computer Science - Graphics, 020901 industrial engineering & automation, Human–computer interaction, Perception, 0202 electrical engineering, electronic engineering, information engineering, Set (psychology), media_common, Artificial neural network, business.industry, Graphics (cs.GR), Visualization, Artificial Intelligence (cs.AI), Embodied cognition, Robot, 020201 artificial intelligence & image processing, Artificial intelligence, business, Robotics (cs.RO)
Abstract: Developing visual perception models for active agents and sensorimotor control are cumbersome to be done in the physical world, as existing algorithms are too slow to efficiently learn in real-time and robots are fragile and costly. This has given rise to learning-in-simulation which consequently casts a question on whether the results transfer to real-world. In this paper, we are concerned with the problem of developing real-world perception for active agents, propose Gibson Virtual Environment for this purpose, and showcase sample perceptual tasks learned therein. Gibson is based on virtualizing real spaces, rather than using artificially designed ones, and currently includes over 1400 floor spaces from 572 full buildings. The main characteristics of Gibson are: I. being from the real-world and reflecting its semantic complexity, II. having an internal synthesis mechanism, "Goggles", enabling deploying the trained models in real-world without needing further domain adaptation, III. embodiment of agents and making them subject to constraints of physics and space., Comment: Access the code, dataset, and project website at http://gibsonenv.vision/ . CVPR 2018
Published: 2018
Full Text: View/download PDF

12. Adversarial Feature Augmentation for Unsupervised Domain Adaptation

Author: Riccardo Volpi, Silvio Savarese, Vittorio Murino, and Pietro Morerio
Subjects: FOS: Computer and information sciences, Artificial neural network, Computer science, business.industry, Computer Vision and Pattern Recognition (cs.CV), Feature vector, Deep learning, Feature extraction, Computer Science - Computer Vision and Pattern Recognition, Pattern recognition, 02 engineering and technology, 010501 environmental sciences, Minimax, 01 natural sciences, Image (mathematics), Feature (computer vision), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, business, 0105 earth and related environmental sciences, Generator (mathematics)
Abstract: Recent works showed that Generative Adversarial Networks (GANs) can be successfully applied in unsupervised domain adaptation, where, given a labeled source dataset and an unlabeled target dataset, the goal is to train powerful classifiers for the target samples. In particular, it was shown that a GAN objective function can be used to learn target features indistinguishable from the source ones. In this work, we extend this framework by (i) forcing the learned feature extractor to be domain-invariant, and (ii) training it through data augmentation in the feature space, namely performing feature augmentation. While data augmentation in the image space is a well established technique in deep learning, feature augmentation has not yet received the same level of attention. We accomplish it by means of a feature generator trained by playing the GAN minimax game against source features. Results show that both enforcing domain-invariance and performing feature augmentation lead to superior or comparable performance to state-of-the-art results in several unsupervised domain adaptation benchmarks., Comment: Accepted to CVPR 2018
Published: 2018
Full Text: View/download PDF

13. Taskonomy: Disentangling Task Transfer Learning

Author: Leonidas J. Guibas, Jitendra Malik, Alexander Sax, William B. Shen, Amir Roshan Zamir, and Silvio Savarese
Subjects: FOS: Computer and information sciences, Computer Science - Artificial Intelligence, Computer science, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, 02 engineering and technology, 010501 environmental sciences, Reuse, Machine learning, computer.software_genre, 01 natural sciences, Machine Learning (cs.LG), Computer Science - Robotics, 0202 electrical engineering, electronic engineering, information engineering, Neural and Evolutionary Computing (cs.NE), 0105 earth and related environmental sciences, business.industry, Computer Science - Neural and Evolutionary Computing, Solver, Computer Science - Learning, Artificial Intelligence (cs.AI), Task analysis, Labeled data, 020201 artificial intelligence & image processing, Artificial intelligence, Transfer of learning, business, Robotics (cs.RO), computer, Intuition
Abstract: Do visual tasks have a relationship, or are they unrelated? For instance, could having surface normals simplify estimating the depth of an image? Intuition answers these questions positively, implying existence of a structure among visual tasks. Knowing this structure has notable values; it is the concept underlying transfer learning and provides a principled way for identifying redundancies across tasks, e.g., to seamlessly reuse supervision among related tasks or solve many tasks in one system without piling up the complexity. We proposes a fully computational approach for modeling the structure of space of visual tasks. This is done via finding (first and higher-order) transfer learning dependencies across a dictionary of twenty six 2D, 2.5D, 3D, and semantic tasks in a latent space. The product is a computational taxonomic map for task transfer learning. We study the consequences of this structure, e.g. nontrivial emerged relationships, and exploit them to reduce the demand for labeled data. For example, we show that the total number of labeled datapoints needed for solving a set of 10 tasks can be reduced by roughly 2/3 (compared to training independently) while keeping the performance nearly the same. We provide a set of tools for computing and probing this taxonomical structure including a solver that users can employ to devise efficient supervision policies for their use cases., CVPR 2018 (Oral). See project website and live demos at http://taskonomy.vision/
Published: 2018
Full Text: View/download PDF

14. Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks

Author: Li Fei-Fei, Alexandre Alahi, Silvio Savarese, Agrim Gupta, and Justin Johnson
Subjects: Trajectory prediction, FOS: Computer and information sciences, 050210 logistics & transportation, Social robot, Computer science, business.industry, Computer Vision and Pattern Recognition (cs.CV), Deep learning, 05 social sciences, Pooling, Generative models, Computer Science - Computer Vision and Pattern Recognition, Mobile robot, 02 engineering and technology, Motion (physics), Variety (cybernetics), Motion estimation, 0502 economics and business, 0202 electrical engineering, electronic engineering, information engineering, Unsupervised learning, 020201 artificial intelligence & image processing, Artificial intelligence, Forecasting models, business
Abstract: Understanding human motion behavior is critical for autonomous moving platforms (like self-driving cars and social robots) if they are to navigate human-centric environments. This is challenging because human motion is inherently multimodal: given a history of human motion paths, there are many socially plausible ways that people could move in the future. We tackle this problem by combining tools from sequence prediction and generative adversarial networks: a recurrent sequence-to-sequence model observes motion histories and predicts future behavior, using a novel pooling mechanism to aggregate information across people. We predict socially plausible futures by training adversarially against a recurrent discriminator, and encourage diverse predictions with a novel variety loss. Through experiments on several datasets we demonstrate that our approach outperforms prior work in terms of accuracy, variety, collision avoidance, and computational complexity.
Published: 2018
Full Text: View/download PDF

15. Im2Pano3D: Extrapolating 360° Structure and Semantics Beyond the Field of View

Author: Manolis Savva, Angel X. Chang, Silvio Savarese, Andy Zeng, Thomas Funkhouser, and Shuran Song
Subjects: Pixel, Computer science, business.industry, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 020207 software engineering, Pattern recognition, Context (language use), 02 engineering and technology, 010501 environmental sciences, Semantics, 01 natural sciences, Convolutional neural network, Consistency (database systems), Encoding (memory), 0202 electrical engineering, electronic engineering, information engineering, Probability distribution, RGB color model, Artificial intelligence, business, 0105 earth and related environmental sciences
Abstract: We present Im2Pano3D, a convolutional neural network that generates a dense prediction of 3D structure and a probability distribution of semantic labels for a full 360° panoramic view of an indoor scene when given only a partial observation (= 50%) in the form of an RGB-D image. To make this possible, Im2Pano3D leverages strong contextual priors learned from large-scale synthetic and real-world indoor scenes. To ease the prediction of 3D structure, we propose to parameterize 3D surfaces with their plane equations and train the model to predict these parameters directly. To provide meaningful training supervision, we use multiple loss functions that consider both pixel level accuracy and global context consistency. Experiments demonstrate that Im2Pano3D is able to predict the semantics and 3D structure of the unobserved scene with more than 56% pixel accuracy and less than 0.52m average distance error, which is significantly better than alternative approaches.
Published: 2018
Full Text: View/download PDF

16. Neural Task Programming: Learning to Generalize Across Hierarchical Tasks

Author: Danfei Xu, Silvio Savarese, Suraj Nair, Julian Gao, Animesh Garg, Yuke Zhu, and Li Fei-Fei
Subjects: FOS: Computer and information sciences, 0209 industrial biotechnology, Computer Science - Artificial Intelligence, Computer science, Generalization, business.industry, Semantics (computer science), 02 engineering and technology, Robot learning, Machine Learning (cs.LG), Data modeling, Computer Science - Learning, Computer Science - Robotics, Task (computing), Artificial Intelligence (cs.AI), 020901 industrial engineering & automation, 0202 electrical engineering, electronic engineering, information engineering, Task analysis, Robot, 020201 artificial intelligence & image processing, Artificial intelligence, business, Robotics (cs.RO)
Abstract: In this work, we propose a novel robot learning framework called Neural Task Programming (NTP), which bridges the idea of few-shot learning from demonstration and neural program induction. NTP takes as input a task specification (e.g., video demonstration of a task) and recursively decomposes it into finer sub-task specifications. These specifications are fed to a hierarchical neural program, where bottom-level programs are callable subroutines that interact with the environment. We validate our method in three robot manipulation tasks. NTP achieves strong generalization across sequential tasks that exhibit hierarchal and compositional structures. The experimental results show that NTP learns to generalize well to- wards unseen tasks with increasing lengths, variable topologies, and changing objectives., Comment: ICRA 2018
Published: 2018
Full Text: View/download PDF

17. Scene Semantic Reconstruction from Egocentric RGB-D-Thermal Videos

Author: Silvio Savarese, Ozan Sener, and Rachel Luo
Subjects: Data stream, 0209 industrial biotechnology, Computer science, business.industry, 3D reconstruction, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 02 engineering and technology, Semantic property, Observer (special relativity), 020901 industrial engineering & automation, Robustness (computer science), 0202 electrical engineering, electronic engineering, information engineering, RGB color model, Leverage (statistics), 020201 artificial intelligence & image processing, Segmentation, Computer vision, Artificial intelligence, business, ComputingMethodologies_COMPUTERGRAPHICS
Abstract: In this paper we focus on the problem of inferring geometric and semantic properties of a complex scene where humans interact with objects from egocentric views. Unlike most previous work, our goal is to leverage a multimodal sensory stream composed of RGB, depth, and thermal (RGB-D-T) signals and use this data stream as an input to a new framework for joint 6 DOF camera localization, 3D reconstruction, and semantic segmentation. As our extensive experimental evaluation shows, the combination of different sensing modalities allows us to achieve greater robustness in situations where both the observer and the objects in the scene move rapidly (a challenging situation for traditional semantic reconstruction methods). Moreover, we contribute a new dataset that includes a large number of egocentric RGB-D-T videos of humans performing daily real-world activities as well as a new demonstration hardware platform for acquiring such a dataset.
Published: 2017
Full Text: View/download PDF

18. Weakly Supervised 3D Reconstruction with Adversarial Constraint

Author: Manmohan Chandraker, Animesh Garg, Silvio Savarese, JunYoung Gwak, and Christopher Choy
Subjects: FOS: Computer and information sciences, Computer science, business.industry, Computer Vision and Pattern Recognition (cs.CV), 3D reconstruction, Pooling, Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 020207 software engineering, CAD, 02 engineering and technology, Real image, Machine learning, computer.software_genre, Backpropagation, law.invention, Constraint (information theory), law, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, Scale (map), business, Manifold (fluid mechanics), computer
Abstract: Supervised 3D reconstruction has witnessed a significant progress through the use of deep neural networks. However, this increase in performance requires large scale annotations of 2D/3D data. In this paper, we explore inexpensive 2D supervision as an alternative for expensive 3D CAD annotation. Specifically, we use foreground masks as weak supervision through a raytrace pooling layer that enables perspective projection and backpropagation. Additionally, since the 3D reconstruction from masks is an ill posed problem, we propose to constrain the 3D reconstruction to the manifold of unlabeled realistic 3D shapes that match mask observations. We demonstrate that learning a log-barrier solution to this constrained optimization problem resembles the GAN objective, enabling the use of existing tools for training GANs. We evaluate and analyze the manifold constrained reconstruction on various datasets for single and multi-view reconstruction of both synthetic and real images.
Published: 2017
Full Text: View/download PDF

19. SEGCloud: Semantic Segmentation of 3D Point Clouds

Author: Iro Armeni, Silvio Savarese, JunYoung Gwak, Christopher Choy, and Lyne P. Tchapmi
Subjects: FOS: Computer and information sciences, Conditional random field, Artificial neural network, business.industry, Computer science, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Trilinear interpolation, Point cloud, 020207 software engineering, Pattern recognition, 02 engineering and technology, computer.software_genre, Voxel, 0202 electrical engineering, electronic engineering, information engineering, Leverage (statistics), 020201 artificial intelligence & image processing, Segmentation, Artificial intelligence, Differentiable function, business, computer
Abstract: 3D semantic scene labeling is fundamental to agents operating in the real world. In particular, labeling raw 3D point sets from sensors provides fine-grained semantics. Recent works leverage the capabilities of Neural Networks (NNs), but are limited to coarse voxel predictions and do not explicitly enforce global consistency. We present SEGCloud, an end-to-end framework to obtain 3D point-level segmentation that combines the advantages of NNs, trilinear interpolation(TI) and fully connected Conditional Random Fields (FC-CRF). Coarse voxel predictions from a 3D Fully Convolutional NN are transferred back to the raw 3D points via trilinear interpolation. Then the FC-CRF enforces global consistency and provides fine-grained semantics on the points. We implement the latter as a differentiable Recurrent NN to allow joint optimization. We evaluate the framework on two indoor and two outdoor 3D datasets (NYU V2, S3DIS, KITTI, Semantic3D.net), and show performance comparable or superior to the state-of-the-art on all datasets., Comment: Accepted as a spotlight at the International Conference of 3D Vision (3DV 2017)
Published: 2017
Full Text: View/download PDF

20. Lattice Long Short-Term Memory for Human Action Recognition

Author: Kevin Chen, Bertram E. Shi, Dit-Yan Yeung, Kui Jia, Lin Sun, and Silvio Savarese
Subjects: FOS: Computer and information sciences, Computer science, business.industry, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Optical flow, Pattern recognition, 02 engineering and technology, 010501 environmental sciences, 01 natural sciences, Recurrent neural network, Gesture recognition, Lattice (order), 0202 electrical engineering, electronic engineering, information engineering, Action recognition, RGB color model, 020201 artificial intelligence & image processing, Artificial intelligence, business, 0105 earth and related environmental sciences
Abstract: Human actions captured in video sequences are three-dimensional signals characterizing visual appearance and motion dynamics. To learn action patterns, existing methods adopt Convolutional and/or Recurrent Neural Networks (CNNs and RNNs). CNN based methods are effective in learning spatial appearances, but are limited in modeling long-term motion dynamics. RNNs, especially Long Short-Term Memory (LSTM), are able to learn temporal motion dynamics. However, naively applying RNNs to video sequences in a convolutional manner implicitly assumes that motions in videos are stationary across different spatial locations. This assumption is valid for short-term motions but invalid when the duration of the motion is long. In this work, we propose Lattice-LSTM (L2STM), which extends LSTM by learning independent hidden state transitions of memory cells for individual spatial locations. This method effectively enhances the ability to model dynamics across time and addresses the non-stationary issue of long-term motion dynamics without significantly increasing the model complexity. Additionally, we introduce a novel multi-modal training procedure for training our network. Unlike traditional two-stream architectures which use RGB and optical flow information as input, our two-stream model leverages both modalities to jointly train both input gates and both forget gates in the network rather than treating the two streams as separate entities with no information about the other. We apply this end-to-end system to benchmark datasets (UCF-101 and HMDB-51) of human action recognition. Experiments show that on both datasets, our proposed method outperforms all existing ones that are based on LSTM and/or CNNs of similar model complexities., Comment: ICCV2017
Published: 2017
Full Text: View/download PDF

21. Tracking the Untrackable: Learning to Track Multiple Cues with Long-Term Dependencies

Author: Alexandre Alahi, Amir Sadeghian, and Silvio Savarese
Subjects: FOS: Computer and information sciences, Computer science, business.industry, Computer Vision and Pattern Recognition (cs.CV), Feature extraction, Computer Science - Computer Vision and Pattern Recognition, 020207 software engineering, 02 engineering and technology, Sensor fusion, Tracking (particle physics), Object detection, Term (time), Recurrent neural network, Robustness (computer science), 0202 electrical engineering, electronic engineering, information engineering, Benchmark (computing), 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, business
Abstract: The majority of existing solutions to the Multi-Target Tracking (MTT) problem do not combine cues in a coherent end-to-end fashion over a long period of time. However, we present an online method that encodes long-term temporal dependencies across multiple cues. One key challenge of tracking methods is to accurately track occluded targets or those which share similar appearance properties with surrounding objects. To address this challenge, we present a structure of Recurrent Neural Networks (RNN) that jointly reasons on multiple cues over a temporal window. We are able to correct many data association errors and recover observations from an occluded state. We demonstrate the robustness of our data-driven approach by tracking multiple targets using their appearance, motion, and even interactions. Our method outperforms previous works on multiple publicly available datasets including the challenging MOT benchmark.
Published: 2017
Full Text: View/download PDF

22. Adversarially Robust Policy Learning: Active construction of physically-plausible perturbations

Author: Ajay Mandlekar, Yuke Zhu, Animesh Garg, Li Fei-Fei, and Silvio Savarese
Subjects: 0209 industrial biotechnology, Computer science, business.industry, Computation, 02 engineering and technology, Machine learning, computer.software_genre, Domain (software engineering), 020901 industrial engineering & automation, 0202 electrical engineering, electronic engineering, information engineering, Code (cryptography), Reinforcement learning, Robot, 020201 artificial intelligence & image processing, Artificial intelligence, State (computer science), Resilience (network), business, computer, Vulnerability (computing)
Abstract: Policy search methods in reinforcement learning have demonstrated success in scaling up to larger problems beyond toy examples. However, deploying these methods on real robots remains challenging due to the large sample complexity required during learning and their vulnerability to malicious intervention. We introduce Adversarially Robust Policy Learning (ARPL), an algorithm that leverages active computation of physically-plausible adversarial examples during training to enable robust policy learning in the source domain and robust performance under both random and adversarial input perturbations. We evaluate ARPL on four continuous control tasks and show superior resilience to changes in physical environment dynamics parameters and environment state as compared to state-of-the-art robust policy learning methods. Code, data, and additional experimental results are available at: stanfordvl.github.io/ARPL
Published: 2017
Full Text: View/download PDF

23. Deep View Morphing

Author: Junghyun Kwon, Max E. McFarland, Dinghuang Ji, and Silvio Savarese
Subjects: FOS: Computer and information sciences, Pixel, business.industry, Computer science, Computer Vision and Pattern Recognition (cs.CV), Visibility (geometry), Interpolation (computer graphics), Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 02 engineering and technology, 010501 environmental sciences, 01 natural sciences, Convolutional neural network, View synthesis, Morphing, Rectification, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, business, 0105 earth and related environmental sciences, Interpolation
Abstract: Recently, convolutional neural networks (CNN) have been successfully applied to view synthesis problems. However, such CNN-based methods can suffer from lack of texture details, shape distortions, or high computational complexity. In this paper, we propose a novel CNN architecture for view synthesis called "Deep View Morphing" that does not suffer from these issues. To synthesize a middle view of two input images, a rectification network first rectifies the two input images. An encoder-decoder network then generates dense correspondences between the rectified images and blending masks to predict the visibility of pixels of the rectified images in the middle view. A view morphing network finally synthesizes the middle view using the dense correspondences and blending masks. We experimentally show the proposed method significantly outperforms the state-of-the-art CNN-based view synthesis method., Accepted to CVPR 2017
Published: 2017
Full Text: View/download PDF

24. Feedback Networks

Author: Amir R. Zamir, Te-Lin Wu, Lin Sun, William B. Shen, Bertram E. Shi, Jitendra Malik, and Silvio Savarese
Published: 2017
Full Text: View/download PDF

25. Unsupervised camera localization in crowded spaces

Author: Silvio Savarese, Li Fei-Fei, Alexandre Alahi, and Judson Wilson
Subjects: Matching (statistics), Social robot, Optimization problem, business.industry, Computer science, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 020206 networking & telecommunications, 02 engineering and technology, Tracking (particle physics), Motion (physics), 0202 electrical engineering, electronic engineering, information engineering, Unsupervised learning, 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, business
Abstract: Existing camera networks in public spaces such as train terminals or malls can help social robots to navigate crowded scenes. However, the localization of the cameras is required, i.e., the positions and poses of all cameras in a unique reference. In this work, we estimate the relative location of any pair of cameras by solely using noisy trajectories observed from each camera. We propose a fully unsupervised learning technique using unlabelled pedestrians motion patterns captured in crowded scenes. We first estimate the pairwise camera parameters by optimally matching single-view pedestrian tracks using social awareness. Then, we show the impact of jointly estimating the network parameters. This is done by formulating a nonlinear least square optimization problem, leveraging a continuous approximation of the matching function. We evaluate our approach in real-world environments such as train terminals, where several hundreds of individuals need to be tracked across dozens of cameras every second.
Published: 2017
Full Text: View/download PDF

26. Subcategory-Aware Convolutional Neural Networks for Object Proposals and Detection

Author: Wongun Choi, Silvio Savarese, Yu Xiang, and Yuanqing Lin
Subjects: FOS: Computer and information sciences, Subcategory, 050210 logistics & transportation, Computer science, business.industry, Computer Vision and Pattern Recognition (cs.CV), 05 social sciences, Computer Science - Computer Vision and Pattern Recognition, Pattern recognition, 02 engineering and technology, Object (computer science), 3D pose estimation, Convolutional neural network, Object detection, Object-class detection, 0502 economics and business, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Viola–Jones object detection framework, Computer vision, Artificial intelligence, business, Pose
Abstract: In CNN-based object detection methods, region proposal becomes a bottleneck when objects exhibit significant scale variation, occlusion or truncation. In addition, these methods mainly focus on 2D object detection and cannot estimate detailed properties of objects. In this paper, we propose subcategory-aware CNNs for object detection. We introduce a novel region proposal network that uses subcategory information to guide the proposal generating process, and a new detection network for joint detection and subcategory classification. By using subcategories related to object pose, we achieve state-of-the-art performance on both detection and pose estimation on commonly used benchmarks., Comment: Published in WACV 2017
Published: 2017
Full Text: View/download PDF

27. Social LSTM: Human Trajectory Prediction in Crowded Spaces

Author: Silvio Savarese, Alexandre Alahi, Vignesh Ramanathan, Kratarth Goel, Li Fei-Fei, and Alexandre Robicquet
Subjects: 050210 logistics & transportation, Sequence, Computer science, business.industry, 05 social sciences, Contrast (statistics), 02 engineering and technology, Motion (physics), Task (project management), Recurrent neural network, 0502 economics and business, Path (graph theory), 0202 electrical engineering, electronic engineering, information engineering, Trajectory, 020201 artificial intelligence & image processing, Artificial intelligence, business
Abstract: Pedestrians follow different trajectories to avoid obstacles and accommodate fellow pedestrians. Any autonomous vehicle navigating such a scene should be able to foresee the future positions of pedestrians and accordingly adjust its path to avoid collisions. This problem of trajectory prediction can be viewed as a sequence generation task, where we are interested in predicting the future trajectory of people based on their past positions. Following the recent success of Recurrent Neural Network (RNN) models for sequence prediction tasks, we propose an LSTM model which can learn general human movement and predict their future trajectories. This is in contrast to traditional approaches which use hand-crafted functions such as Social forces. We demonstrate the performance of our method on several public datasets. Our model outperforms state-of-the-art methods on some of these datasets. We also analyze the trajectories predicted by our model to demonstrate the motion behaviour learned by our model.
Published: 2016
Full Text: View/download PDF

28. DeLay: Robust Spatial Layout Estimation for Cluttered Indoor Scenes

Author: Kevin Chen, Silvio Savarese, Saumitro Dasgupta, and Kuan Fang
Subjects: Cuboid, Artificial neural network, Computer science, business.industry, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 02 engineering and technology, 010501 environmental sciences, 01 natural sciences, Convolutional neural network, Margin (machine learning), Robustness (computer science), 0202 electrical engineering, electronic engineering, information engineering, Clutter, 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, Vanishing point, business, Projection (set theory), 0105 earth and related environmental sciences
Abstract: We consider the problem of estimating the spatial layout of an indoor scene from a monocular RGB image, modeled as the projection of a 3D cuboid. Existing solutions to this problem often rely strongly on hand-engineered features and vanishing point detection, which are prone to failure in the presence of clutter. In this paper, we present a method that uses a fully convolutional neural network (FCNN) in conjunction with a novel optimization framework for generating layout estimates. We demonstrate that our method is robust in the presence of clutter and handles a wide range of highly challenging scenes. We evaluate our method on two standard benchmarks and show that it achieves state of the art results, outperforming previous methods by a wide margin.
Published: 2016
Full Text: View/download PDF

29. 3D Semantic Parsing of Large-Scale Indoor Spaces

Author: Ioannis Brilakis, Silvio Savarese, Ozan Sener, Martin Fischer, Helen Jiang, Amir Roshan Zamir, and Iro Armeni
Subjects: Parsing, business.industry, Coordinate system, Point cloud, 020207 software engineering, 02 engineering and technology, Image segmentation, Notation, computer.software_genre, 4013 Geomatic Engineering, 46 Information and Computing Sciences, Robustness (computer science), Histogram, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Segmentation, Computer vision, Data mining, Artificial intelligence, business, computer, 40 Engineering, Mathematics
Abstract: In this paper, we propose a method for semantic parsing the 3D point cloud of an entire building using a hierarchical approach: first, the raw data is parsed into semantically meaningful spaces (e.g. rooms, etc) that are aligned into a canonical reference coordinate system. Second, the spaces are parsed into their structural and building elements (e.g. walls, columns, etc). Performing these with a strong notation of global 3D space is the backbone of our method. The alignment in the first step injects strong 3D priors from the canonical coordinate system into the second step for discovering elements. This allows diverse challenging scenarios as man-made indoor spaces often show recurrent geometric patterns while the appearance features can change drastically. We also argue that identification of structural elements in indoor spaces is essentially a detection problem, rather than segmentation which is commonly used. We evaluated our method on a new dataset of several buildings with a covered area of over 6, 000m2 and over 215 million points, demonstrating robust results readily useful for practical applications.
Published: 2016
Full Text: View/download PDF

30. Robust single-view instance recognition

Author: Sebastian Thrun, David Held, and Silvio Savarese
Subjects: Artificial neural network, Computer science, business.industry, 3D single-object recognition, 02 engineering and technology, 010501 environmental sciences, Object (computer science), 01 natural sciences, Method, Robustness (computer science), 0202 electrical engineering, electronic engineering, information engineering, Robot, 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, Set (psychology), business, Neural coding, 0105 earth and related environmental sciences
Abstract: Some robots must repeatedly interact with a fixed set of objects in their environment. To operate correctly, it is helpful for the robot to be able to recognize the object instances that it repeatedly encounters. However, current methods for recognizing object instances require that, during training, many pictures are taken of each object from a large number of viewing angles. This procedure is slow and requires much manual effort before the robot can begin to operate in a new environment. We have developed a novel procedure for training a neural network to recognize a set of objects from just a single training image per object. To obtain robustness to changes in viewpoint, we take advantage of a supplementary dataset in which we observe a separate (non-overlapping) set of objects from multiple viewpoints. After pre-training the network in a novel multi-stage fashion, the network can robustly recognize new object instances given just a single training image of each object. If more images of each object are available, the performance improves. We perform a thorough analysis comparing our novel training procedure to traditional neural network pre-training techniques as well as previous state-of-the-art approaches including keypoint-matching, template-matching, and sparse coding, and we demonstrate that our method significantly outperforms these previous approaches. Our method can thus be used to easily teach a robot to recognize a novel set of object instances from unknown viewpoints.
Published: 2016
Full Text: View/download PDF

31. Learning to Track: Online Multi-object Tracking by Decision Making

Author: Yu Xiang, Silvio Savarese, and Alexandre Alahi
Subjects: business.industry, Computer science, Frame (networking), Markov process, Machine learning, computer.software_genre, Object (computer science), Object detection, symbols.namesake, Video tracking, Benchmark (computing), symbols, Reinforcement learning, Artificial intelligence, Markov decision process, business, computer
Abstract: Online Multi-Object Tracking (MOT) has wide applications in time-critical video analysis scenarios, such as robot navigation and autonomous driving. In tracking-by-detection, a major challenge of online MOT is how to robustly associate noisy object detections on a new video frame with previously tracked objects. In this work, we formulate the online MOT problem as decision making in Markov Decision Processes (MDPs), where the lifetime of an object is modeled with a MDP. Learning a similarity function for data association is equivalent to learning a policy for the MDP, and the policy learning is approached in a reinforcement learning fashion which benefits from both advantages of offline-learning and online-learning for data association. Moreover, our framework can naturally handle the birth/death and appearance/disappearance of targets by treating them as state transitions in the MDP while leveraging existing online single object tracking methods. We conduct experiments on the MOT Benchmark [24] to verify the effectiveness of our method.
Published: 2015
Full Text: View/download PDF

32. Semantic Cross-View Matching

Author: Francesco Palmieri, Francesco Castaldo, Amir Roshan Zamir, Roland Angst, Silvio Savarese, ICVV, Castaldo, F., Zamir, A., Angst, R., Palmieri, Francesco A. N., and Savarese, Silvio
Subjects: FOS: Computer and information sciences, Geographic information system, business.industry, Computer science, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Pattern recognition, 02 engineering and technology, Image segmentation, 020204 information systems, 11. Sustainability, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, business
Abstract: Matching cross-view images is challenging because the appearance and viewpoints are significantly different. While low-level features based on gradient orientations or filter responses can drastically vary with such changes in viewpoint, semantic information of images however shows an invariant characteristic in this respect. Consequently, semantically labeled regions can be used for performing cross-view matching. In this paper, we therefore explore this idea and propose an automatic method for detecting and representing the semantic information of an RGB image with the goal of performing cross-view matching with a (non-RGB) geographic information system (GIS). A segmented image forms the input to our system with segments assigned to semantic concepts such as traffic signs, lakes, roads, foliage, etc. We design a descriptor to robustly capture both, the presence of semantic concepts and the spatial layout of those segments. Pairwise distances between the descriptors extracted from the GIS map and the query image are then used to generate a shortlist of the most promising locations with similar semantic concepts in a consistent spatial layout. An experimental evaluation with challenging query images and a large urban area shows promising results.
Published: 2015
Full Text: View/download PDF

33. Unsupervised Semantic Parsing of Video Collections

Author: Ashutosh Saxena, Ozan Sener, Silvio Savarese, and Amir Roshan Zamir
Subjects: FOS: Computer and information sciences, Structure (mathematical logic), Parsing, Point (typography), business.industry, Computer science, Computer Vision and Pattern Recognition (cs.CV), media_common.quotation_subject, Computer Science - Computer Vision and Pattern Recognition, computer.software_genre, Generative model, Semantic computing, Quality (business), Artificial intelligence, Joint (audio engineering), business, computer, Natural language processing, media_common
Abstract: Human communication typically has an underlying structure. This is reflected in the fact that in many user generated videos, a starting point, ending, and certain objective steps between these two can be identified. In this paper, we propose a method for parsing a video into such semantic steps in an unsupervised way. The proposed method is capable of providing a semantic "storyline" of the video composed of its objective steps. We accomplish this using both visual and language cues in a joint generative model. The proposed method can also provide a textual description for each of the identified semantic steps and video segments. We evaluate this method on a large number of complex YouTube videos and show results of unprecedented quality for this intricate and impactful problem.
Published: 2015
Full Text: View/download PDF

34. Watch-n-patch: Unsupervised understanding of actions and relations

Author: Silvio Savarese, Jiemi Zhang, Ashutosh Saxena, and Chenxia Wu
Subjects: Sequence, Focus (computing), business.industry, Computer science, Statistical model, Machine learning, computer.software_genre, Action (philosophy), RGB color model, Segmentation, Artificial intelligence, business, Set (psychology), computer
Abstract: We focus on modeling human activities comprising multiple actions in a completely unsupervised setting. Our model learns the high-level action co-occurrence and temporal relations between the actions in the activity video. We consider the video as a sequence of short-term action clips, called action-words, and an activity is about a set of action-topics indicating which actions are present in the video. Then we propose a new probabilistic model relating the action-words and the action-topics. It allows us to model long-range action relations that commonly exist in the complex activity, which is challenging to capture in the previous works. We apply our model to unsupervised action segmentation and recognition, and also to a novel application that detects forgotten actions, which we call action patching. For evaluation, we also contribute a new challenging RGB-D activity video dataset recorded by the new Kinect v2, which contains several human daily activities as compositions of multiple actions interacted with different objects. The extensive experiments show the effectiveness of our model.
Published: 2015
Full Text: View/download PDF

35. Enriching object detection with 2D-3D registration and continuous viewpoint estimation

Author: Sam Corbett-Davies, Silvio Savarese, Christopher Choy, and Michael Stark
Subjects: Discriminative model, Computer science, business.industry, Computer vision, Artificial intelligence, Object (computer science), business, Pose, Object detection
Abstract: A large body of recent work on object detection has focused on exploiting 3D CAD model databases to improve detection performance. Many of these approaches work by aligning exact 3D models to images using templates generated from renderings of the 3D models at a set of discrete viewpoints. However, the training procedures for these approaches are computationally expensive and require gigabytes of memory and storage, while the viewpoint discretization hampers pose estimation performance. We propose an efficient method for synthesizing templates from 3D models that runs on the fly - that is, it quickly produces detectors for an arbitrary viewpoint of a 3D model without expensive dataset-dependent training or template storage. Given a 3D model and an arbitrary continuous detection viewpoint, our method synthesizes a discriminative template by extracting features from a rendered view of the object and decorrelating spatial dependences among the features. Our decorrelation procedure relies on a gradient-based algorithm that is more numerically stable than standard decomposition-based procedures, and we efficiently search for candidate detections by computing FFT-based template convolutions. Due to the speed of our template synthesis procedure, we are able to perform joint optimization of scale, translation, continuous rotation, and focal length using Metropolis-Hastings algorithm. We provide an efficient GPU implementation of our algorithm, and we validate its performance on 3D Object Classes and PASCAL3D+ datasets.
Published: 2015
Full Text: View/download PDF

36. Data-driven 3D Voxel Patterns for object category recognition

Author: Silvio Savarese, Yu Xiang, Wongun Choi, and Yuanqing Lin
Subjects: Computer science, business.industry, Cognitive neuroscience of visual object recognition, Pattern recognition, computer.software_genre, 3D pose estimation, Object-class detection, Voxel, Segmentation, Computer vision, Viola–Jones object detection framework, Artificial intelligence, Representation (mathematics), business, Pose, computer
Abstract: Despite the great progress achieved in recognizing objects as 2D bounding boxes in images, it is still very challenging to detect occluded objects and estimate the 3D properties of multiple objects from a single image. In this paper, we propose a novel object representation, 3D Voxel Pattern (3DVP), that jointly encodes the key properties of objects including appearance, 3D shape, viewpoint, occlusion and truncation. We discover 3DVPs in a data-driven way, and train a bank of specialized detectors for a dictionary of 3DVPs. The 3DVP detectors are capable of detecting objects with specific visibility patterns and transferring the meta-data from the 3DVPs to the detected objects, such as 2D segmentation mask, 3D pose as well as occlusion or truncation boundaries. The transferred meta-data allows us to infer the occlusion relationship among objects, which in turn provides improved object recognition results. Experiments are conducted on the KITTI detection benchmark [17] and the outdoor-scene dataset [41]. We improve state-of-the-art results on car detection and pose estimation with notable margins (6% in difficult data of KITTI). We also verify the ability of our method in accurately segmenting objects from the background and localizing them in 3D.
Published: 2015
Full Text: View/download PDF

37. Learning an Image-Based Motion Context for Multiple People Tracking

Author: Silvio Savarese, Alina Kuznetsova, Laura Leal-Taixé, Michele Fenzi, and Bodo Rosenhahn
Subjects: Feature (computer vision), business.industry, Social force model, Context (language use), Computer vision, Artificial intelligence, Representation (mathematics), business, Tracking (particle physics), Motion (physics), Mathematics, Random forest, Feature detection (computer vision)
Abstract: We present a novel method for multiple people tracking that leverages a generalized model for capturing interactions among individuals. At the core of our model lies a learned dictionary of interaction feature strings which capture relationships between the motions of targets. These feature strings, created from low-level image features, lead to a much richer representation of the physical interactions between targets compared to hand-specified social force models that previous works have introduced for tracking. One disadvantage of using social forces is that all pedestrians must be detected in order for the forces to be applied, while our method is able to encode the effect of undetected targets, making the tracker more robust to partial occlusions. The interaction feature strings are used in a Random Forest framework to track targets according to the features surrounding them. Results on six publicly available sequences show that our method outperforms state-of-the-art approaches in multiple people tracking.
Published: 2014
Full Text: View/download PDF

38. Beyond PASCAL: A benchmark for 3D object detection in the wild

Author: Silvio Savarese, Roozbeh Mottaghi, and Yu Xiang
Subjects: 2d images, Computer science, business.industry, Testbed, Solid modeling, Pascal (programming language), Object detection, Computer vision, Electronic design automation, Artificial intelligence, business, Pose, computer, computer.programming_language
Abstract: 3D object detection and pose estimation methods have become popular in recent years since they can handle ambiguities in 2D images and also provide a richer description for objects compared to 2D object detectors. However, most of the datasets for 3D recognition are limited to a small amount of images per category or are captured in controlled environments. In this paper, we contribute PASCAL3D+ dataset, which is a novel and challenging dataset for 3D object detection and pose estimation. PASCAL3D+ augments 12 rigid categories of the PASCAL VOC 2012 [4] with 3D annotations. Furthermore, more images are added for each category from ImageNet [3]. PASCAL3D+ images exhibit much more variability compared to the existing 3D datasets, and on average there are more than 3,000 object instances per category. We believe this dataset will provide a rich testbed to study 3D detection and pose estimation and will help to significantly push forward research in this area. We provide the results of variations of DPM [6] on our new dataset for object detection and viewpoint estimation in different scenarios, which can be used as baselines for the community. Our benchmark is available online at http://cvgl.stanford.edu/projects/pascal3d
Published: 2014
Full Text: View/download PDF

39. Understanding the 3D layout of a cluttered room from multiple images

Author: Li Fei-Fei, Axel Furlan, Sid Yingze Bao, Silvio Savarese, Bao, S, Furlan, A, Fei Fei, L, and Savarese, S
Subjects: Structure (mathematical logic), business.industry, Computer science, Cognitive neuroscience of visual object recognition, Image segmentation, Viewpoints, Semantics, ING-INF/05 - SISTEMI DI ELABORAZIONE DELLE INFORMAZIONI, Scene understanding, indoor, clutter, moving camera, robotics, computer vision, Image (mathematics), Computer vision, Ceiling (aeronautics), Augmented reality, Artificial intelligence, business
Abstract: We present a novel framework for robustly understanding the geometrical and semantic structure of a cluttered room from a small number of images captured from different viewpoints. The tasks we seek to address include: i) estimating the 3D layout of the room - that is, the 3D configuration of floor, walls and ceiling; ii) identifying and localizing all the foreground objects in the room. We jointly use multiview geometry constraints and image appearance to identify the best room layout configuration. Extensive experimental evaluation demonstrates that our estimation results are more complete and accurate in estimating 3D room structure and recognizing objects than alternative state-of-the-art algorithms. In addition, we show an augmented reality mobile application to highlight the high accuracy of our method, which may be beneficial to many computer vision applications.
Published: 2014
Full Text: View/download PDF

40. Find the Best Path: An Efficient and Accurate Classifier for Image Hierarchies

Author: Silvio Savarese, Min Sun, and Wan Huang
Subjects: Support vector machine, Hierarchy, Contextual image classification, Branch and bound, Structured support vector machine, business.industry, Pattern recognition, Artificial intelligence, Greedy algorithm, business, Greedy randomized adaptive search procedure, Hierarchical classifier, Mathematics
Abstract: Many methods have been proposed to solve the image classification problem for a large number of categories. Among them, methods based on tree-based representations achieve good trade-off between accuracy and test time efficiency. While focusing on learning a tree-shaped hierarchy and the corresponding set of classifiers, most of them [11, 2, 14] use a greedy prediction algorithm for test time efficiency. We argue that the dramatic decrease in accuracy at high efficiency is caused by the specific design choice of the learning and greedy prediction algorithms. In this work, we propose a classifier which achieves a better trade-off between efficiency and accuracy with a given tree-shaped hierarchy. First, we convert the classification problem as finding the best path in the hierarchy, and a novel branch-and-bound-like algorithm is introduced to efficiently search for the best path. Second, we jointly train the classifiers using a novel Structured SVM (SSVM) formulation with additional bound constraints. As a result, our method achieves a significant 4.65%, 5.43%, and 4.07% (relative 24.82%, 41.64%, and 109.79%) improvement in accuracy at high efficiency compared to state-of-the-art greedy "tree-based" methods [14] on Caltech-256 [15], SUN [32] and Image Net 1K [9] dataset, respectively. Finally, we show that our branch-and-bound-like algorithm naturally ranks the paths in the hierarchy (Fig. 8) so that users can further process them.
Published: 2013
Full Text: View/download PDF

41. 3D Scene Understanding by Voxel-CRF

Author: Silvio Savarese, Pushmeet Kohli, and Byung-soo Kim
Subjects: Conditional random field, business.industry, Computer science, 3D reconstruction, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Pattern recognition, Iterative reconstruction, computer.software_genre, Object (computer science), Voxel, RGB color model, Computer vision, Artificial intelligence, business, computer, ComputingMethodologies_COMPUTERGRAPHICS
Abstract: Scene understanding is an important yet very challenging problem in computer vision. In the past few years, researchers have taken advantage of the recent diffusion of depth-RGB (RGB-D) cameras to help simplify the problem of inferring scene semantics. However, while the added 3D geometry is certainly useful to segment out objects with different depth values, it also adds complications in that the 3D geometry is often incorrect because of noisy depth measurements and the actual 3D extent of the objects is usually unknown because of occlusions. In this paper we propose a new method that allows us to jointly refine the 3D reconstruction of the scene (raw depth values) while accurately segmenting out the objects or scene elements from the 3D reconstruction. This is achieved by introducing a new model which we called Voxel-CRF. The Voxel-CRF model is based on the idea of constructing a conditional random field over a 3D volume of interest which captures the semantic and 3D geometric relationships among different elements (voxels) of the scene. Such model allows to jointly estimate (1) a dense voxel-based 3D reconstruction and (2) the semantic labels associated with each voxel even in presence of partial occlusions using an approximate yet efficient inference strategy. We evaluated our method on the challenging NYU Depth dataset (Version 1 and 2). Experimental results show that our method achieves competitive accuracy in inferring scene semantics and visually appealing results in improving the quality of the 3D reconstruction. We also demonstrate an interesting application of object removal and scene completion from RGB-D images.
Published: 2013
Full Text: View/download PDF

42. Breaking the Chain: Liberation from the Temporal Markov Assumption for Tracking Human Poses

Author: Wongun Choi, Ryan Tokola, and Silvio Savarese
Subjects: business.industry, Markov model, Machine learning, computer.software_genre, Consistency (database systems), Feature (computer vision), Video tracking, Path (graph theory), Markov property, Artificial intelligence, Hidden Markov model, business, computer, Pose, Mathematics
Abstract: We present an approach to multi-target tracking that has expressive potential beyond the capabilities of chain-shaped hidden Markov models, yet has significantly reduced complexity. Our framework, which we call tracking-by-selection, is similar to tracking-by-detection in that it separates the tasks of detection and tracking, but it shifts temporal reasoning from the tracking stage to the detection stage. The core feature of tracking-by-selection is that it reasons about path hypotheses that traverse the entire video instead of a chain of single-frame object hypotheses. A traditional chain-shaped tracking-by-detection model is only able to promote consistency between one frame and the next. In tracking-by-selection, path hypotheses exist across time, and encouraging long-term temporal consistency is as simple as rewarding path hypotheses with consistent image features. One additional advantage of tracking-by-selection is that it results in a dramatically simplified model that can be solved exactly. We adapt an existing tracking-by-detection model to the tracking-by-selection framework, and show improved performance on a challenging dataset.
Published: 2013
Full Text: View/download PDF

43. EVA: An efficient vision architecture for mobile systems

Author: Jason Clemons, Andrea Pellegrini, Silvio Savarese, and Todd Austin
Published: 2013
Full Text: View/download PDF

44. Understanding Indoor Scenes Using 3D Geometric Phrases

Author: Caroline Pantofaru, Yu-Wei Chao, Silvio Savarese, and Wongun Choi
Subjects: Phrase, Computer science, business.industry, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Scene statistics, Pattern recognition, Solid modeling, Object (computer science), Semantics, Object detection, Core (graph theory), Computer vision, Artificial intelligence, business, ComputingMethodologies_COMPUTERGRAPHICS
Abstract: Visual scene understanding is a difficult problem interleaving object detection, geometric reasoning and scene classification. We present a hierarchical scene model for learning and reasoning about complex indoor scenes which is computationally tractable, can be learned from a reasonable amount of training data, and avoids oversimplification. At the core of this approach is the 3D Geometric Phrase Model which captures the semantic and geometric relationships between objects which frequently co-occur in the same 3D spatial configuration. Experiments show that this model effectively explains scene semantics, geometry and object groupings from a single image, while also improving individual object detections.
Published: 2013
Full Text: View/download PDF

45. Weakly Supervised Learning of Mid-Level Features with Beta-Bernoulli Process Restricted Boltzmann Machines

Author: Silvio Savarese, Honglak Lee, Benjamin Kuipers, and Roni Mittelman
Subjects: Restricted Boltzmann machine, Class (computer programming), Computer science, business.industry, Process (engineering), Supervised learning, Boltzmann machine, Cognitive neuroscience of visual object recognition, Semantics, Machine learning, computer.software_genre, Feature (machine learning), Artificial intelligence, business, computer
Abstract: The use of semantic attributes in computer vision problems has been gaining increased popularity in recent years. Attributes provide an intermediate feature representation in between low-level features and the class categories, and offer several attractive properties, among which are improved learning of novel categories based on few examples, as well as allowing for zero-shot learning. However, the major caveat is that learning semantic attributes is a laborious task, requiring a significant amount of time and human intervention to provide labels. In order to address this issue, we propose a weakly supervised approach to learn mid-level features, where the only supervision is provided by the category classes of the training examples. We develop a novel extension of the restricted Boltzmann machine (RBM) with Beta-Bernoulli process priors. Unlike the standard RBM, our model uses the class labels to promote more efficient sharing of information by different categories. This tends to improve the generalization performance. By using semantic attributes for which annotations are available, we show that we can find correspondences between the mid-level features that we learn and the labeled attributes. Therefore, the mid-level features have distinct semantic characterization which is very similar to that given by the semantic attributes, even though their labeling was not used during the training process. Our experimental results in object recognition tasks show significant performance gains, outperforming methods which rely on manually labeled semantic attributes.
Published: 2013
Full Text: View/download PDF

46. Dense Object Reconstruction with Semantic Priors

Author: Yuanqing Lin, Manmohan Chandraker, Silvio Savarese, and Sid Yingze Bao
Subjects: business.industry, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Point cloud, Pattern recognition, Iterative reconstruction, Object (computer science), Object detection, Set (abstract data type), Active shape model, Prior probability, Structure from motion, Computer vision, Artificial intelligence, business, ComputingMethodologies_COMPUTERGRAPHICS, Mathematics
Abstract: We present a dense reconstruction approach that overcomes the drawbacks of traditional multiview stereo by incorporating semantic information in the form of learned category-level shape priors and object detection. Given training data comprised of 3D scans and images of objects from various viewpoints, we learn a prior comprised of a mean shape and a set of weighted anchor points. The former captures the commonality of shapes across the category, while the latter encodes similarities between instances in the form of appearance and spatial consistency. We propose robust algorithms to match anchor points across instances that enable learning a mean shape for the category, even with large shape variations across instances. We model the shape of an object instance as a warped version of the category mean, along with instance-specific details. Given multiple images of an unseen instance, we collate information from 2D object detectors to align the structure from motion point cloud with the mean shape, which is subsequently warped and refined to approach the actual shape. Extensive experiments demonstrate that our model is general enough to learn semantic priors for different object categories, yet powerful enough to reconstruct individual shapes with large variations. Qualitative and quantitative evaluations show that our framework can produce more accurate reconstructions than alternative state-of-the-art multiview stereo systems.
Published: 2013
Full Text: View/download PDF

47. Accurate Localization of 3D Objects from RGB-D Data Using Segmentation Hypotheses

Author: Silvio Savarese, Byung-soo Kim, and Shili Xu
Subjects: Computer science, Segmentation-based object categorization, business.industry, 3D single-object recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Cognitive neuroscience of visual object recognition, Pattern recognition, Image segmentation, Object detection, Support vector machine, Object-class detection, RGB color model, Viola–Jones object detection framework, Computer vision, Segmentation, Artificial intelligence, business
Abstract: In this paper we focus on the problem of detecting objects in 3D from RGB-D images. We propose a novel framework that explores the compatibility between segmentation hypotheses of the object in the image and the corresponding 3D map. Our framework allows to discover the optimal location of the object using a generalization of the structural latent SVM formulation in 3D as well as the definition of a new loss function defined over the 3D space in training. We evaluate our method using two existing RGB-D datasets. Extensive quantitative and qualitative experimental results show that our proposed approach outperforms state-of-the-art as methods well as a number of baseline approaches for both 3D and 2D object recognition tasks.
Published: 2013
Full Text: View/download PDF

48. Toward mutual information based automatic registration of 3D point clouds

Author: James R. McBride, Silvio Savarese, Gaurav Pandey, and Ryan M. Eustice
Subjects: Training set, Computer science, business.industry, Feature extraction, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Point cloud, Image registration, Pattern recognition, Mutual information, Omnidirectional camera, Robustness (computer science), Histogram, Computer vision, Artificial intelligence, business, Cluster analysis
Abstract: This paper reports a novel mutual information (MI) based algorithm for automatic registration of unstructured 3D point clouds comprised of co-registered 3D lidar and camera imagery. The proposed method provides a robust and principled framework for fusing the complementary information obtained from these two different sensing modalities. High-dimensional features are extracted from a training set of textured point clouds (scans) and hierarchical k-means clustering is used to quantize these features into a set of codewords. Using this codebook, any new scan can be represented as a collection of codewords. Under the correct rigid-body transformation aligning two overlapping scans, the MI between the codewords present in the scans is maximized. We apply a James-Stein-type shrinkage estimator to estimate the true MI from the marginal and joint histograms of the codewords extracted from the scans. Experimental results using scans obtained by a vehicle equipped with a 3D laser scanner and an omnidirectional camera are used to validate the robustness of the proposed algorithm over a wide range of initial conditions. We also show that the proposed method works well with 3D data alone.
Published: 2012
Full Text: View/download PDF

49. An efficient branch-and-bound algorithm for optimal human pose estimation

Author: Murali Telaprolu, Min Sun, Silvio Savarese, and Honglak Lee
Subjects: Tree (data structure), Branch and bound, Computer science, business.industry, Inference, Pattern recognition, Graphical model, Artificial intelligence, Star (graph theory), 3D pose estimation, business, Pose, Decision tree model
Abstract: Human pose estimation in a static image is a challenging problem in computer vision in that body part configurations are often subject to severe deformations and occlusions. Moreover, efficient pose estimation is often a desirable requirement in many applications. The trade-off between accuracy and efficiency has been explored in a large number of approaches. On the one hand, models with simple representations (like tree or star models) can be efficiently applied in pose estimation problems. However, these models are often prone to body part misclassification errors. On the other hand, models with rich representations (i.e., loopy graphical models) are theoretically more robust, but their inference complexity may increase dramatically. In this work, we propose an efficient and exact inference algorithm based on branch-and-bound to solve the human pose estimation problem on loopy graphical models. We show that our method is empirically much faster (about 74 times) than the state-of-the-art exact inference algorithm [21]. By extending a state-of-the-art tree model [16] to a loopy graphical model, we show that the estimation accuracy improves for most of the body parts (especially lower arms) on popular datasets such as Buffy [7] and Stickmen [5] datasets. Finally, our method can be used to exactly solve most of the inference problems on Stretchable Models [18] (which contains a few hundreds of variables) in just a few minutes.
Published: 2012
Full Text: View/download PDF

50. Semantic structure from motion with points, regions, and objects

Author: Silvio Savarese, Sid Yingze Bao, Yu-Wei Chao, and Mohit Bagra
Subjects: Robustness (computer science), business.industry, Computer science, 3D single-object recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Cognitive neuroscience of visual object recognition, Structure from motion, Computer vision, Pattern recognition, Artificial intelligence, 3D pose estimation, business, Pose
Abstract: Structure from motion (SFM) aims at jointly recovering the structure of a scene as a collection of 3D points and estimating the camera poses from a number of input images. In this paper we generalize this concept: not only do we want to recover 3D points, but also recognize and estimate the location of high level semantic scene components such as regions and objects in 3D. As a key ingredient for this joint inference problem, we seek to model various types of interactions between scene components. Such interactions help regularize our solution and obtain more accurate results than solving these problems in isolation. Experiments on public datasets demonstrate that: 1) our framework estimates camera poses more robustly than SFM algorithms that use points only; 2) our framework is capable of accurately estimating pose and location of objects, regions, and points in the 3D scene; 3) our framework recognizes objects and regions more accurately than state-of-the-art single image recognition methods.
Published: 2012
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

71 results on '"Silvio Savarese"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources